A Novel Magnitude-Squared Spectrum Cost Function for Speech Enhancement

Huan Zhao, Zhiqiang Lu School of Information Science and Engineering, Hunan University, Lushan South Rood, Changsha 410082, P. R. China, phone: +86 731 88821563, e-mail: hzhao@hun.edu.cn Fei Yu Jiangsu Provincial Key Lab of Image Processing & Image Communications Nanjing University of Posts and Telecommunicationsy, Nanjing 310002, P. R. China, phone: +86 18229764458, e-mail: hunanyufei@126.com Cheng Xu School of Information Science and Engineering, Hunan University, Lushan South Rood, Changsha 410082, P. R. China, Jiangsu Provincial Key Lab of Image Processing & Image Communications Nanjing University of Posts and Telecommunicationsy, Nanjing 310002, P. R. China


Introduction
The problem of improving the quality and intelligibility of speech in noisy environments has attracted a great deal of interest in a long time.The existence of noise is inevitable in real-world application of speech processing.In particular, speech coders and speech recognition systems might be rendered useless in the presence of background noise.
Numerous techniques have been developed, and conventional speech enhancement algorithms basically consist of four classes of algorithms, including spectral subtraction [1], subspace [2], statistical model based [3] and Wiener filter based algorithms [4].The well-known Ephraim-Malah algorithm which base on statistical model is an MMSE [3] estimator for the speech DFT amplitude.In this study, we also choose the Byes risk as the basis since it is the most fundamental statistical model approach, and many algorithms are closely connected to this technique [5].Minimizing the Byes risk for a given cost function results in a variety of estimators.In fact, the maximum a posteriori (MAP) [6] estimator, minimum mean square error (MMSE) and maximum likelihood (ML) [7] estimators can be derived from the different Bayes risk cost functions.Also it is not difficult to notice that the Bayesian estimators based on perceptually motivated cost functions in place of traditional cost function are tightly related to the Byes risk [8][9][10].In summary, different Bayesian estimators can be derived depending on the choice of the cost function.In recent years, Yang Lu and Loizou et al [11] propose a new speech enhancement algorithm which assumes that the magnitude-square spectrum of the noisy speech signal can be computed as the sum of the (clean) signal and noise magnitude-squared spectra, and finally, they derive a MMSE-MSS estimator which uses the squared-error cost function.Motivated by the previously mentioned assumption, we derive a novel speech enhancement by using other distortion measure in this paper.Results show that the proposed estimator yielded lower residual noise and lower speech distortion than the conventional MMSE-MSS estimator, in terms of yielding better speech quality.
This paper is organized as follows.In section 2, the background information of Bayes risk is given.In section 3, the proposed algorithm is presented.The experimental results of comparing the algorithm proposed in this paper with other algorithms are also presented in section 4. Finally, our work of this paper is summarized in the last section.

Bayes risk
Supposing the observed noisy speech y(n)=x(n)+d(n), is assumed to be clean speech signals x(n) perturbed by statistically independent additive noise signals d(n).The short-time Fourier transform of y(n) can be expressed as Y(ω k )=X(ω k )+D(ω k ).For ω k =2πk/N and k=0,1,2,…,N-1, the equation is equivalent to the polar form as exp( ( )) exp( ( )) exp( ( )) where θ y (k), θ x (k) and θ d (k) denote the phases and Y k , X k , D k are spectral magnitudes at frequency bin k of the noisy speech, clean signal, and noise respectively.The short-time spectral magnitude estimation of X k can be expressed in a form of ˆk X  G(k)·Y k , where the gain function are interpreted as the a priori and a posteriori SNRs, respectively.
denote clean speech and noise speech variances, respectively.
It is noted that when we measure the speech quality, the spectral magnitude is more important than its phase.So we focus on the estimation of the spectral magnitude, X k from the noisy spectral magnitude Where ε denotes the error in estimating the magnitude X k at frequency bin k.The well-known Bayes risk can be given by the following where E( .), p( .), p( .| .) denote expectation function, probability density function and conditional probability density function, respectively.It is of great interest in using different distortion measures to derive the variety of estimators.

Proposed algorithm
Minimizing R B is just minimizing the following where ( ( )) is corresponding to the inner integral in (2), we refer to this equation as conditional average cost function.A variety of traditional cost functions have been developed, when we substitute the squared-error cost function , and take the derivative R with respect to X k and set it equal to zero, then we can get the well-known MMSE estimator [3].while we can get the MAP estimator [6] with the given function where  denotes minimum and positive parameter.It should be noticed that the ML estimator is a special case of the MAP estimator, and it assumes that the density of X k obey uniform distribution.1) Minimum Mean Squared-Error Estimator.Generally, the above analysis is in contrary to magnitude spectrum but not to magnitude-squared spectrum.However, some special cost functions also can be appropriate for magnitude-squared spectrum.To do that, we must derive the corresponding conditional average cost function.In order to get this, we take the 2 k X and 2 k Y as a whole, respectively.And then replace (3) as following . Depending on the above equation, we can derive the MMSE-MSS estimator.Yang Lu, et al [11] propose a new solution with this distortion measure . Finally, we can obtain Additionally, where υ is defined as 2) Conditional Median Estimator.In this section, we investigate other cost functions which rely on the magnitude-squared spectrum assumption.The distortion measure is defined as magnitude-squared absolute error The above equation is defined as conditional median, and utilizing the conditional median to estimate 2 k X , owing to: where k  is a positive parameter.Substituting (8), (9), into (7), and using 1/λ(k)=1/ 2 x The simplification of the above equation using λ(k)= 1 ln 2 ln(1 exp( )) , 0.5 , For this reason, we expect that the CM-MSS estimator can perform better than the MMSE-MMS estimator.
In order to carry out the comparison between the estimators, we need to know the a priori SNR ξ k .Thus, we use the"decision-directed" [3] approach: ) where l denotes the frame index and α denotes tunable coefficient.

Experiments and results
To evaluate the performance of the proposed and derived estimators, a total of 30 sentences taken from the publicly-available NOIZEUS database were used.The sentences were corrupted by car, babble, white and street noise at 0, 5, 10, and15 dB SNRs.Speech was segmented into 20 ms frames and han-windowed with 50% overlap.The overlap-add method was used to obtain the enhanced signal.The estimation of the noisespectra was using the algorithm of minimum controlled recursive average (MCRA) [12].In (12), the value of α was set to 0.97.In order to assess the performance, two objective measurements, namely, average segmental signal to noise radio(SNRseg) and Perceptual Evaluation of Speech Quality (PESQ) [13], were utilized.The PESQ measure which has been found to yield a high correlation with the speech quality [14], is the best measure for overall speech quality prediction both of the speech quality and noise distortion.Higher PESQ values indicate better performance, i.e., better speech quality.In terms of background noise distortion, SNRseg is the best measure.Like the PESQ, higher SNRseg values indicate that the enhanced signal is more similar to clean speech.The SNRseg measure is defined by In the above equation, in which x k , ˆk x denote clean speech and estimated speech, respectively, here, M and L denote total number of frames and the length of frames, respectively.
Table 1 and Table 2 show the performance comparison in terms of SNRseg and PESQ between the various estimations.In terms of SNRseg, which is easy to implement and is better correlated with Mean opinion score (MOS) than SNR, it has been widely used to qualify the enhanced speech.It is not difficult to see from the Table 1, as for the four types of noise conditions at all SNR levels, the CM-MSS estimator yields significantly higher SNRseg values than the MMSE-MSS estimator.
PESQ is more reliable and correlated better with MOS than the traditional measures in most situations.In terms of PESQ, the overall results were shown in Table 2.As well as the performance of SNRseg, under different types of noise conditions at all SNR levels, the CM-MSS estimator yielded significantly higher PESQ scores than the MMSE-MSS estimator, either.In summary, the proposed estimator offers better speech quality and lower speech distortion than the MMSE-MSS estimator.

Conclusions
In this paper, we report several existing Bayesian short-time spectral amplitude cost functions for speech enhancement.There are no prior studies in the condtional median, therefore,we propsoe a new MMS estimator where the distortion measure is the absolute error function.The derived estimator, which markedly reduces the background noise without introducing speech distortion, it is superior to the MMSE-MSS estimator in terms of both SNRseg and PESQ.Our future work is to calculate the condtional median estimato of the magnitude spectrum.

Fig. 1 .
Fig. 1.Gain function of the proposed estimator and MMSE-MSS estimatorIt is clearly that the MMSE-MMS estimator and


denotes the estimate of the noise variance.Particularly, ξ min =-20dB.

Table 1 .
Performance comparison, in terms of SNRseg, between the various estimations

Table 2 .
Performance comparison, in terms of PESQ, between the various estimations