A TIME-WARPING FRAMEWORK FOR SPEECH TURBULENCE-NOISE COMPONENT ESTIMATION DURING APERIODIC PHONATION*

by user

on 15-09-2016

Category: Documents

>> Downloads: 6

views

Report

Comments

Description

Download A TIME-WARPING FRAMEWORK FOR SPEECH TURBULENCE-NOISE COMPONENT ESTIMATION DURING APERIODIC PHONATION*

Transcript

A TIME-WARPING FRAMEWORK FOR SPEECH TURBULENCE-NOISE COMPONENT ESTIMATION DURING APERIODIC PHONATION*

A TIME-WARPING FRAMEWORK FOR SPEECH TURBULENCE-NOISE COMPONENT
ESTIMATION DURING APERIODIC PHONATION*
Nicolas Malyska and Thomas F. Quatieri
MIT Lincoln Laboratory
{nmalyska, quatieri}@ll.mit.edu
Index Terms— Time warping, phonation, aspiration,
irregularity, glottal closure
periods into the noise component. This work acts as a proof of
concept for more sophisticated noise-separation algorithms using
time-warping as a front end.
We note that in different contexts, time-warping has been
applied as a preprocessor for speech analysis. Wang and Cuperman
[4], for example, have described a successful method using
dynamic time warping for voicing estimation in a speech-coding
application. Another case is speech enhancement using a nonlinear time-warping preprocessor, as reported by Ramalho and
Mammone [5]. Additionally, Murphy describes a “perturbationfree” measure of harmonic-to-noise ratio in voice, which averages
many time-warped glottal cycles in order to estimate the
component not due to noise components [6]. Our work builds upon
previous research in that it (1) specifically focuses on aperiodic
phonation, where the time-warping function varies rapidly,
(2) exploits explicit estimates of the glottal-pulse times to develop
the time-warping function, and (3) is applied to the problem of
estimating the turbulent-noise component of speech.
1. INTRODUCTION
2. CONVENTIONAL NOISE SEPARATION
The process of estimating the speech turbulence-noise
component is important for many applications, including separate
modification of the noise component [1], analysis of degree of
speech aspiration for treating pathological voice, the automatic
labeling of speech voicing, as well as speaker characterization and
recognition. A number of algorithms exist for estimation of the
turbulence-noise component, such as by Jackson and Shadle [1],
Serra and Smith [2], and d’Alessandro et al. [3]. Although these
algorithms perform well in periodic regions, issues arise during
aperiodic regions of speech. Deviations from the assumption of
periodicity lead to leakage of the phonation component of speech
into the noise component.
In this paper, we improve upon existing algorithms by
introducing a time-warping based approach to speech noisecomponent estimation. We describe this process in two parts: (1) A
time-warping framework for making aperiodic phonation appear
periodic, where we use the Jackson and Shadle algorithm as an
example and (2) Experiments demonstrating the effectiveness of
the turbulence noise-component separation over conventional
approaches with both natural and synthetic data. With the new
time-warping framework, we show that significant gains can be
achieved in terms of preventing the leakage of irregular pitch
A number of algorithms exist for estimation of the turbulencenoise component. One popular one by Jackson and Shadle [1] is
based on a pitch-dependent short-time Fourier transform
representation. The Jackson-Shadle noise-separation algorithm is
referred to as pitch-scaled harmonic filtering (PSHF) [1]. The
PSHF method is desirable since there is evidence that it can
preserve the temporal-modulation characteristics of the noise
component, i.e., noise excitations are typically modulated by the
periodic glottal flow wave. The PSHF approach uses an analysis
window duration equal to four pitch periods and relies on the
property that harmonics of the fundamental frequency fall at
specific frequency bins of the short-time Fourier transform,
specifically with every-other discrete-Fourier transform (DFT)
coefficient being selected [1]. The residual values remaining after
DFT selection provide one noise component spectrum, while a
second noise estimate can be obtained by interpolation of the
residual across zero DFT gaps.
Although our focus here is the Jackson-Shadle algorithm, our
framework described in the following section allows application of
other decomposition algorithms. For example, d'Alessandro et al.
[3] have proposed a decomposition method that incorporates a
cepstral lifter (analogous to a spectral comb filter) to iteratively
separate the harmonic and noise components, relying on the DFT
coefficients to contain a contribution from both a periodic
component and a noise component. As an alternative, an earlier
sinewave-based decomposition algorithm was proposed by Serra
and Smith [2]. An important difference from that of d’Alessandro
ABSTRACT
The accurate estimation of turbulence noise affects many areas of
speech processing including separate modification of the noise
component, analysis of degree of speech aspiration for treating
pathological voice, the automatic labeling of speech voicing, as
well as speaker characterization and recognition. Previous work in
the literature has provided methods by which such a high-quality
noise component may be estimated in near-periodic speech, but it
is known that these methods tend to leak aperiodic phonation (with
even slight deviations from periodicity) into the noise-component
estimate. In this paper, we improve upon existing algorithms in
conditions of aperiodicity by introducing a time-warping based
approach to speech noise-component estimation, demonstrating the
results on both natural and synthetic speech examples.
*This work is sponsored by the United States Air Force Research
Laboratory under Air Force Contract FA8721-05-C-0002. Opinions,
interpretations, conclusions, and recommendations are those of the
authors and are not necessarily endorsed by the United States
Government.
978-1-4577-0539-7/11/$26.00 ©2011 IEEE
5404
ICASSP 2011
Figure 1. Schematic of noise-component-separation performed
using the time-warping preprocessor and associated compensation.
et al. is that the Serra and Smith algorithm does not restrict the
deterministic part of the signal to contain harmonically-related
frequency components, and hence do not require periodicity due to
an interest in analyzing non-harmonic signals such as music, as
well as speech. Nevertheless, due to the need for sinewave
estimation from peak-picking, the algorithm can benefit from
improved stationarity.
3. TIME-WARPING FRAMEWORK FOR NOISECOMPONENT SEPARATION
This section describes the framework that is used to make
aperiodic phonation appear periodic prior to performing
turbulence-noise decomposition. Although the decomposition
algorithm of focus here is the Jackson and Shadle noise-component
separation algorithm [1], the framework is applicable to any
decomposition algorithm that requires a short-time stationarity
constraint. As depicted in Figure 1, the general framework consists
of a time-warping signal preprocessor, an analysis stage, and a
preprocessor-compensation stage.
The first step in the process is to (optionally) whiten the
acoustic signal in order to remove the effects of poles in the vocaltract response as much as possible. This step is used since timewarping tends to distort the formant frequencies and bandwidths in
a signal. In this work, we use an order-12 autocorrelation-linearprediction-based inverse filter, with analysis and synthesis
implemented using the Voicebox MATLAB toolbox [7]. The
frame interval and analysis Hamming window for the analysissynthesis are 10 and 20 ms , respectively. After whitening the input
signal, time-warping is applied in order to transform the irregular
glottal pulses into regular glottal pulses. This is implemented using
a discrete-time simulation (oversampling by a large factor) of
continuous-time warping. There are many valid warping functions
that will accomplish this transformation. We use a smoothlyinterpolated contour, which warps the irregular glottal pulses into
periodic pulses. The pulse times used to derive this contour are
obtained from either known closure times in synthetic speech or
EGG-derived glottal closure times in natural speech. Closure times
are extracted from the EGG signal using the SIGMA algorithm [8],
implemented in the Voicebox MATLAB toolbox [7]. This smooth
contour warps the irregular glottal pulses into periodic pulses. In
the experiments reported for this paper, all periods are warped to
reflect a constant pitch, 125 Hz.
Once time-warping has been performed, we apply noise
separation. As mentioned, noise separation should be a scheme that
can benefit from periodicity but more generally from the invoking
of stationarity. We implemented the Jackson-Shadle algorithm in
Figure 2. Comparison of the conventional noise-separation
algorithm against the same algorithm operating with the time-warp
front end. The input signal is a synthetic impulse train having
nominal pitch of 125 Hz with ±0.5 ms of uniformly-distributed
jitter.
MATLAB to separate the periodic and noise components of speech
signals.
An important feature of our framework is that it requires no
changes to the implementation of the conventional separation
system, as is the case in all three of the algorithms described above.
In the preprocessor-compensation stage following analysis, the
signals that are output by the analysis stage must be unwarped in
time so that they align with the input signals. Finally, to undo the
whitening that has been applied to the input acoustic signal,
spectral shaping is reapplied to each component.
4. EXPERIMENTS
In this section, we perform turbulent-noise decomposition on
synthetic and real speech as a demonstration of the capabilities of
the new framework. In particular, we demonstrate that the
framework can improve separation of the contribution of irregular
pitch periods from the contribution of voice aspiration noise. In
both the time-warping and no-time-warping cases, the separation
algorithm is the Jackson and Shadle algorithm [1], implemented
without interpolation of the power spectrum. The only differences
between the two approaches is the presence of the time-warping
(and its companion time-warping-compensation) stage. Spectral
whitening and whitening-compensation are applied in both
scenarios.
4.1. Synthetic speech
In this section, we show how conventional separation allows major
leakage which is significantly reduced with the new algorithm. For
the synthetic cases in this section, we quantify performance by
using a signal with irregular pulse spacing and with a known
amount of additive noise energy.
Two Case Studies: To focus on the effect of time-warping, we
first remove the whitening stage and work with a series of synthetic
impulse trains. Furthermore, to isolate the effect of jitter and its
5405
Figure 3. Comparison of the conventional noise-separation
algorithm against the same algorithm operating with the time-warp
front end. The input signal is a synthetic vowel /a/ having nominal
pitch of 125 Hz with ±0.5 ms of jitter applied. Speech-shaped
noise is added to the voiced part at a signal-to-noise ratio of 12 dB.
role in leaking the voiced component, we do not include noise
addition in this initial experiment. Figure 2 compares the results
using the Jackson and Shadle algorithm with and without the timewarping front end. The input signal, shown in the top panel, is a
synthetic impulse-train having a nominal 8-ms period (125 Hz) and
uniformly jittered by ± 0.5 ms. The results of this simulation show
a substantial improvement when using the time-warping system.
First, we observe that even the small amount of jitter present
introduces a large leakage of phonation into the noise component
of the conventional system. This leakage is removed by the timewarping front end. Additionally, the conventional system creates
multiple erroneous pulses in the resynthesized voiced component,
which are not present in the new system.
When the effects of the glottal shape, vocal tract and additive
noise are modeled, the time-warping framework continues to show
an advantage. In Figure 3, the input signal, shown in the top panel,
is a synthetic vowel /a/ having a nominal 8-ms period (125 Hz) and
uniformly jittered by ± 0.5 ms, combined with a speech-shaped
noise component at a signal-to-noise ratio of 12 dB. We observe
that even the small amount of jitter present introduces a large
leakage of phonation into the noise component of the conventional
system. This leakage is significantly reduced by the time-warping
front end.
Quantifying Performance: Each algorithm produces two types of
leakage. In the first type, energy that corresponds to voicing
appears in the estimate of the turbulent-noise component. In the
second type, energy that corresponds to noise appears in the
voicing component. Previous work in the literature use error
metrics that combine these two effects; an example of such a
metric is the signal-to-error ratio (SER) in [1]. In contrast, we
quantify each leakage separately as:
Figure 4. Comparison of energy leakage with the conventional
noise-separation algorithm against the same algorithm operating
with the time-warp front end. The input signal is a synthetic vowel
/a/ having nominal pitch of 125 Hz with the specified amount of
uniformly-distributed jitter. Speech-shaped noise is added to the
voiced part at a signal-to-noise ratio of 12 dB.
LVoiced =
x T yˆ
x yˆ + x T xˆ
T
LNoise =
yT xˆ
,
y yˆ + y T xˆ
T
where x is the known voiced component, x̂ is the estimated
voiced component, y is the known noise component, and ŷ is the
estimated noise. x and y are treated as orthogonal bases against
which we are projecting the two estimated signals, allowing us to
determine their individual contributions.
We can show that LVoiced is approximately the ratio of the
amount of voiced energy leaked into the turbulent-noise estimate
to the total voiced energy. Likewise, LNoise is approximately the
ratio of the amount of noise energy leaked into the voiced estimate
to the total unvoiced energy. Optimally, the metrics will both be 0,
indicating no leakage of a signal into the incorrect estimate. In the
worst case, the leakage ratios will be equal to 1, indicating that an
entire component is erroneously present in the incorrect estimate.
Figure 4 shows the results of sweeping the extent of jitter on
leakage. The input signal is 3 seconds of the kind of signal shown
in Figure 3, consisting of a synthetic vowel /a/ combined with a
speech-shaped noise component at a signal-to-noise ratio of 12 dB.
As can be observed in the figure, jitter poses a significant problem
for the conventional noise-separation algorithm in terms of the
amount of leaked voicing energy. The time-warping front-end
largely corrects this issue, with only a small amount of the voiced
component being seen in the noise-component estimate. For noisecomponent leakage, we do not expect to do better than the
conventional approach with time-warping, as reflected in the
figure. Noise-component leakage for both approaches remains at
about 50% leakage over all jitter extents. This is a fundamental
property of a time-warping front-end: we can sharpen spectral
harmonics due to voicing, but similar spectral sharpening cannot
be accomplished by time warping noise signals.
5406
With either ground truth or a high-quality estimate of the times
of glottal excitation, our technique was shown to reduce the
leakage of voiced energy into the noise-component estimate. We
also demonstrated the property whereby the leakage of the noise
energy into the voice component remains approximately the same,
whether or not time warping is used. These results have application
to estimation of the aspiration component of speech, which is
associated with the percept of breathiness or whisper.
Future work will incorporate estimates of the times of glottal
excitation from the acoustic signal, rather than relying on nonacoustic sensors such as EGG. We will also study the effect of
errors in excitation-time estimation on the performance gains
shown in this paper.
REFERENCES
[1] P. J. B. Jackson and C. H. Shadle, "Pitch-scaled estimation of
simultaneous voiced and turbulence-noise components in speech,"
IEEE Trans. Speech Audio Processing, vol. 9, pp. 713-726, 2001.
Figure 5. Comparison of the conventional noise-separation
algorithm against the same algorithm operating with the time-warp
front end. The input signal is a segment of natural speech produced
by a male speaker.
[2] X. Serra and J. Smith, “Spectral modeling synthesis: A sound
analysis/synthesis system based on deterministic plus stochastic
decomposition,” Comp. Mus. J., vol. 14, no. 4, pp. 12-24, 1990.
[3] C. R. d'Alessandro, B. Yegnanarayana, and V. Darsinos, ,
"Decomposition of speech signals into deterministic and stochastic
components," in Proc. ICASSP, pp. 760-763, 1995.
4.2. Natural speech
In addition to experiments with synthetic irregular phonation, we
have also performed tests using natural speech. In this case, we do
not know truth for the voiced and noise components.
Figure 5 compares the results using the Jackson-Shadle
algorithm with and without the time-warping front end. The input
signal, shown in the top panel, is a section of irregular pitch
periods for a male speaker. As with the synthetic example, the
results of this experiment show a substantial improvement when
using the time-warping system. Once again, we observe large
leakage of phonation into the noise component of the conventional
system. This leakage is largely corrected by the time-warping front
end, and is especially important for the irregular-phonation
segment in the vicinity of ~3.34-3.40 seconds, where the
conventional algorithm leaks significant voiced component into the
estimated noise component.
5. CONCLUSION AND FUTURE WORK
In this paper, we have provided evidence that a time-warping
framework can be used to improve the performance of algorithms
that rely on assumptions of local periodicity. As a proof of
concept, we have focused on a turbulent speech noise
decomposition algorithm by Jackson and Shadle, although other
separation algorithms are applicable such as those reviewed in
Section 2.
5407
[4] T. Wang and V. Cuperman, “Robust voicing estimation with
dynamic time warping,” in Proc ICASSP, pp. 533-536, 1998.
[5] M. A. Ramalho and R. J. Mammone, “New speech
enhancement techniques using the pitch mode modulation model,”
in Proc. 36th Midwest Symposium on Circuits and Systems, pp.
1531-1534, 1993.
[6] P. J. Murphy, “Perturbation-free measurement of the
harmonics-to-noise ratio in voice signals using pitch synchronous
harmonic analysis,” J. Acoust. Soc. Am., vol. 105, pp. 2866-2881,
1999.
[7] M. Brookes, Voicebox: A Speech Processing Toolbox for
MATLAB. 2006. [Online]. Available:
http://www.ee.imperial.ac.uk/hp/staff/dmb/voicebox/voicebox.html
[8] M. R. P. Thomas and P. A. Naylor, "The SIGMA algorithm: A
glottal activity detector for electroglottographic signals," IEEE
Trans. Audio, Speech, Language Processing, vol. 17, pp. 15571566, 2009.