A TIME-WARPING FRAMEWORK FOR SPEECH TURBULENCE-NOISE COMPONENT ESTIMATION DURING APERIODIC PHONATION*
by user
Comments
Transcript
A TIME-WARPING FRAMEWORK FOR SPEECH TURBULENCE-NOISE COMPONENT ESTIMATION DURING APERIODIC PHONATION*
A TIME-WARPING FRAMEWORK FOR SPEECH TURBULENCE-NOISE COMPONENT ESTIMATION DURING APERIODIC PHONATION* Nicolas Malyska and Thomas F. Quatieri MIT Lincoln Laboratory {nmalyska, quatieri}@ll.mit.edu Index Terms— Time warping, phonation, aspiration, irregularity, glottal closure periods into the noise component. This work acts as a proof of concept for more sophisticated noise-separation algorithms using time-warping as a front end. We note that in different contexts, time-warping has been applied as a preprocessor for speech analysis. Wang and Cuperman [4], for example, have described a successful method using dynamic time warping for voicing estimation in a speech-coding application. Another case is speech enhancement using a nonlinear time-warping preprocessor, as reported by Ramalho and Mammone [5]. Additionally, Murphy describes a “perturbationfree” measure of harmonic-to-noise ratio in voice, which averages many time-warped glottal cycles in order to estimate the component not due to noise components [6]. Our work builds upon previous research in that it (1) specifically focuses on aperiodic phonation, where the time-warping function varies rapidly, (2) exploits explicit estimates of the glottal-pulse times to develop the time-warping function, and (3) is applied to the problem of estimating the turbulent-noise component of speech. 1. INTRODUCTION 2. CONVENTIONAL NOISE SEPARATION The process of estimating the speech turbulence-noise component is important for many applications, including separate modification of the noise component [1], analysis of degree of speech aspiration for treating pathological voice, the automatic labeling of speech voicing, as well as speaker characterization and recognition. A number of algorithms exist for estimation of the turbulence-noise component, such as by Jackson and Shadle [1], Serra and Smith [2], and d’Alessandro et al. [3]. Although these algorithms perform well in periodic regions, issues arise during aperiodic regions of speech. Deviations from the assumption of periodicity lead to leakage of the phonation component of speech into the noise component. In this paper, we improve upon existing algorithms by introducing a time-warping based approach to speech noisecomponent estimation. We describe this process in two parts: (1) A time-warping framework for making aperiodic phonation appear periodic, where we use the Jackson and Shadle algorithm as an example and (2) Experiments demonstrating the effectiveness of the turbulence noise-component separation over conventional approaches with both natural and synthetic data. With the new time-warping framework, we show that significant gains can be achieved in terms of preventing the leakage of irregular pitch A number of algorithms exist for estimation of the turbulencenoise component. One popular one by Jackson and Shadle [1] is based on a pitch-dependent short-time Fourier transform representation. The Jackson-Shadle noise-separation algorithm is referred to as pitch-scaled harmonic filtering (PSHF) [1]. The PSHF method is desirable since there is evidence that it can preserve the temporal-modulation characteristics of the noise component, i.e., noise excitations are typically modulated by the periodic glottal flow wave. The PSHF approach uses an analysis window duration equal to four pitch periods and relies on the property that harmonics of the fundamental frequency fall at specific frequency bins of the short-time Fourier transform, specifically with every-other discrete-Fourier transform (DFT) coefficient being selected [1]. The residual values remaining after DFT selection provide one noise component spectrum, while a second noise estimate can be obtained by interpolation of the residual across zero DFT gaps. Although our focus here is the Jackson-Shadle algorithm, our framework described in the following section allows application of other decomposition algorithms. For example, d'Alessandro et al. [3] have proposed a decomposition method that incorporates a cepstral lifter (analogous to a spectral comb filter) to iteratively separate the harmonic and noise components, relying on the DFT coefficients to contain a contribution from both a periodic component and a noise component. As an alternative, an earlier sinewave-based decomposition algorithm was proposed by Serra and Smith [2]. An important difference from that of d’Alessandro ABSTRACT The accurate estimation of turbulence noise affects many areas of speech processing including separate modification of the noise component, analysis of degree of speech aspiration for treating pathological voice, the automatic labeling of speech voicing, as well as speaker characterization and recognition. Previous work in the literature has provided methods by which such a high-quality noise component may be estimated in near-periodic speech, but it is known that these methods tend to leak aperiodic phonation (with even slight deviations from periodicity) into the noise-component estimate. In this paper, we improve upon existing algorithms in conditions of aperiodicity by introducing a time-warping based approach to speech noise-component estimation, demonstrating the results on both natural and synthetic speech examples. *This work is sponsored by the United States Air Force Research Laboratory under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government. 978-1-4577-0539-7/11/$26.00 ©2011 IEEE 5404 ICASSP 2011 Figure 1. Schematic of noise-component-separation performed using the time-warping preprocessor and associated compensation. et al. is that the Serra and Smith algorithm does not restrict the deterministic part of the signal to contain harmonically-related frequency components, and hence do not require periodicity due to an interest in analyzing non-harmonic signals such as music, as well as speech. Nevertheless, due to the need for sinewave estimation from peak-picking, the algorithm can benefit from improved stationarity. 3. TIME-WARPING FRAMEWORK FOR NOISECOMPONENT SEPARATION This section describes the framework that is used to make aperiodic phonation appear periodic prior to performing turbulence-noise decomposition. Although the decomposition algorithm of focus here is the Jackson and Shadle noise-component separation algorithm [1], the framework is applicable to any decomposition algorithm that requires a short-time stationarity constraint. As depicted in Figure 1, the general framework consists of a time-warping signal preprocessor, an analysis stage, and a preprocessor-compensation stage. The first step in the process is to (optionally) whiten the acoustic signal in order to remove the effects of poles in the vocaltract response as much as possible. This step is used since timewarping tends to distort the formant frequencies and bandwidths in a signal. In this work, we use an order-12 autocorrelation-linearprediction-based inverse filter, with analysis and synthesis implemented using the Voicebox MATLAB toolbox [7]. The frame interval and analysis Hamming window for the analysissynthesis are 10 and 20 ms , respectively. After whitening the input signal, time-warping is applied in order to transform the irregular glottal pulses into regular glottal pulses. This is implemented using a discrete-time simulation (oversampling by a large factor) of continuous-time warping. There are many valid warping functions that will accomplish this transformation. We use a smoothlyinterpolated contour, which warps the irregular glottal pulses into periodic pulses. The pulse times used to derive this contour are obtained from either known closure times in synthetic speech or EGG-derived glottal closure times in natural speech. Closure times are extracted from the EGG signal using the SIGMA algorithm [8], implemented in the Voicebox MATLAB toolbox [7]. This smooth contour warps the irregular glottal pulses into periodic pulses. In the experiments reported for this paper, all periods are warped to reflect a constant pitch, 125 Hz. Once time-warping has been performed, we apply noise separation. As mentioned, noise separation should be a scheme that can benefit from periodicity but more generally from the invoking of stationarity. We implemented the Jackson-Shadle algorithm in Figure 2. Comparison of the conventional noise-separation algorithm against the same algorithm operating with the time-warp front end. The input signal is a synthetic impulse train having nominal pitch of 125 Hz with ±0.5 ms of uniformly-distributed jitter. MATLAB to separate the periodic and noise components of speech signals. An important feature of our framework is that it requires no changes to the implementation of the conventional separation system, as is the case in all three of the algorithms described above. In the preprocessor-compensation stage following analysis, the signals that are output by the analysis stage must be unwarped in time so that they align with the input signals. Finally, to undo the whitening that has been applied to the input acoustic signal, spectral shaping is reapplied to each component. 4. EXPERIMENTS In this section, we perform turbulent-noise decomposition on synthetic and real speech as a demonstration of the capabilities of the new framework. In particular, we demonstrate that the framework can improve separation of the contribution of irregular pitch periods from the contribution of voice aspiration noise. In both the time-warping and no-time-warping cases, the separation algorithm is the Jackson and Shadle algorithm [1], implemented without interpolation of the power spectrum. The only differences between the two approaches is the presence of the time-warping (and its companion time-warping-compensation) stage. Spectral whitening and whitening-compensation are applied in both scenarios. 4.1. Synthetic speech In this section, we show how conventional separation allows major leakage which is significantly reduced with the new algorithm. For the synthetic cases in this section, we quantify performance by using a signal with irregular pulse spacing and with a known amount of additive noise energy. Two Case Studies: To focus on the effect of time-warping, we first remove the whitening stage and work with a series of synthetic impulse trains. Furthermore, to isolate the effect of jitter and its 5405 Figure 3. Comparison of the conventional noise-separation algorithm against the same algorithm operating with the time-warp front end. The input signal is a synthetic vowel /a/ having nominal pitch of 125 Hz with ±0.5 ms of jitter applied. Speech-shaped noise is added to the voiced part at a signal-to-noise ratio of 12 dB. role in leaking the voiced component, we do not include noise addition in this initial experiment. Figure 2 compares the results using the Jackson and Shadle algorithm with and without the timewarping front end. The input signal, shown in the top panel, is a synthetic impulse-train having a nominal 8-ms period (125 Hz) and uniformly jittered by ± 0.5 ms. The results of this simulation show a substantial improvement when using the time-warping system. First, we observe that even the small amount of jitter present introduces a large leakage of phonation into the noise component of the conventional system. This leakage is removed by the timewarping front end. Additionally, the conventional system creates multiple erroneous pulses in the resynthesized voiced component, which are not present in the new system. When the effects of the glottal shape, vocal tract and additive noise are modeled, the time-warping framework continues to show an advantage. In Figure 3, the input signal, shown in the top panel, is a synthetic vowel /a/ having a nominal 8-ms period (125 Hz) and uniformly jittered by ± 0.5 ms, combined with a speech-shaped noise component at a signal-to-noise ratio of 12 dB. We observe that even the small amount of jitter present introduces a large leakage of phonation into the noise component of the conventional system. This leakage is significantly reduced by the time-warping front end. Quantifying Performance: Each algorithm produces two types of leakage. In the first type, energy that corresponds to voicing appears in the estimate of the turbulent-noise component. In the second type, energy that corresponds to noise appears in the voicing component. Previous work in the literature use error metrics that combine these two effects; an example of such a metric is the signal-to-error ratio (SER) in [1]. In contrast, we quantify each leakage separately as: Figure 4. Comparison of energy leakage with the conventional noise-separation algorithm against the same algorithm operating with the time-warp front end. The input signal is a synthetic vowel /a/ having nominal pitch of 125 Hz with the specified amount of uniformly-distributed jitter. Speech-shaped noise is added to the voiced part at a signal-to-noise ratio of 12 dB. LVoiced = x T yˆ x yˆ + x T xˆ T LNoise = yT xˆ , y yˆ + y T xˆ T where x is the known voiced component, x̂ is the estimated voiced component, y is the known noise component, and ŷ is the estimated noise. x and y are treated as orthogonal bases against which we are projecting the two estimated signals, allowing us to determine their individual contributions. We can show that LVoiced is approximately the ratio of the amount of voiced energy leaked into the turbulent-noise estimate to the total voiced energy. Likewise, LNoise is approximately the ratio of the amount of noise energy leaked into the voiced estimate to the total unvoiced energy. Optimally, the metrics will both be 0, indicating no leakage of a signal into the incorrect estimate. In the worst case, the leakage ratios will be equal to 1, indicating that an entire component is erroneously present in the incorrect estimate. Figure 4 shows the results of sweeping the extent of jitter on leakage. The input signal is 3 seconds of the kind of signal shown in Figure 3, consisting of a synthetic vowel /a/ combined with a speech-shaped noise component at a signal-to-noise ratio of 12 dB. As can be observed in the figure, jitter poses a significant problem for the conventional noise-separation algorithm in terms of the amount of leaked voicing energy. The time-warping front-end largely corrects this issue, with only a small amount of the voiced component being seen in the noise-component estimate. For noisecomponent leakage, we do not expect to do better than the conventional approach with time-warping, as reflected in the figure. Noise-component leakage for both approaches remains at about 50% leakage over all jitter extents. This is a fundamental property of a time-warping front-end: we can sharpen spectral harmonics due to voicing, but similar spectral sharpening cannot be accomplished by time warping noise signals. 5406 With either ground truth or a high-quality estimate of the times of glottal excitation, our technique was shown to reduce the leakage of voiced energy into the noise-component estimate. We also demonstrated the property whereby the leakage of the noise energy into the voice component remains approximately the same, whether or not time warping is used. These results have application to estimation of the aspiration component of speech, which is associated with the percept of breathiness or whisper. Future work will incorporate estimates of the times of glottal excitation from the acoustic signal, rather than relying on nonacoustic sensors such as EGG. We will also study the effect of errors in excitation-time estimation on the performance gains shown in this paper. REFERENCES [1] P. J. B. Jackson and C. H. Shadle, "Pitch-scaled estimation of simultaneous voiced and turbulence-noise components in speech," IEEE Trans. Speech Audio Processing, vol. 9, pp. 713-726, 2001. Figure 5. Comparison of the conventional noise-separation algorithm against the same algorithm operating with the time-warp front end. The input signal is a segment of natural speech produced by a male speaker. [2] X. Serra and J. Smith, “Spectral modeling synthesis: A sound analysis/synthesis system based on deterministic plus stochastic decomposition,” Comp. Mus. J., vol. 14, no. 4, pp. 12-24, 1990. [3] C. R. d'Alessandro, B. Yegnanarayana, and V. Darsinos, , "Decomposition of speech signals into deterministic and stochastic components," in Proc. ICASSP, pp. 760-763, 1995. 4.2. Natural speech In addition to experiments with synthetic irregular phonation, we have also performed tests using natural speech. In this case, we do not know truth for the voiced and noise components. Figure 5 compares the results using the Jackson-Shadle algorithm with and without the time-warping front end. The input signal, shown in the top panel, is a section of irregular pitch periods for a male speaker. As with the synthetic example, the results of this experiment show a substantial improvement when using the time-warping system. Once again, we observe large leakage of phonation into the noise component of the conventional system. This leakage is largely corrected by the time-warping front end, and is especially important for the irregular-phonation segment in the vicinity of ~3.34-3.40 seconds, where the conventional algorithm leaks significant voiced component into the estimated noise component. 5. CONCLUSION AND FUTURE WORK In this paper, we have provided evidence that a time-warping framework can be used to improve the performance of algorithms that rely on assumptions of local periodicity. As a proof of concept, we have focused on a turbulent speech noise decomposition algorithm by Jackson and Shadle, although other separation algorithms are applicable such as those reviewed in Section 2. 5407 [4] T. Wang and V. Cuperman, “Robust voicing estimation with dynamic time warping,” in Proc ICASSP, pp. 533-536, 1998. [5] M. A. Ramalho and R. J. Mammone, “New speech enhancement techniques using the pitch mode modulation model,” in Proc. 36th Midwest Symposium on Circuits and Systems, pp. 1531-1534, 1993. [6] P. J. Murphy, “Perturbation-free measurement of the harmonics-to-noise ratio in voice signals using pitch synchronous harmonic analysis,” J. Acoust. Soc. Am., vol. 105, pp. 2866-2881, 1999. [7] M. Brookes, Voicebox: A Speech Processing Toolbox for MATLAB. 2006. [Online]. Available: http://www.ee.imperial.ac.uk/hp/staff/dmb/voicebox/voicebox.html [8] M. R. P. Thomas and P. A. Naylor, "The SIGMA algorithm: A glottal activity detector for electroglottographic signals," IEEE Trans. Audio, Speech, Language Processing, vol. 17, pp. 15571566, 2009.