Peculiarities of Wiener Class Systems and their Exploitation for Speech Signal Prediction

Speech technologies are widely employed in modern telecommunication systems. Received speech signal quality [1], spoken word recognition accuracy [2] and successful speaker identification correctness [3, 4] strongly rely on speech modelling precision. Even modern audio coders are based on linear models that are used to simulate speech, which in general is non-linear and non-stationary. Therefore, non-perfect modelling occurs. As majority of real world systems are not linear in nature, nonlinear methods for their modelling are required and application range of those models is very wide. Therefore, in a research literature of a last decade Wiener class systems (see Fig. 1), defined as linear time invariant, dynamic, causal and stable subsystem followed by static nonlinearity, have drawn considerable attention. Several possible parametric and non-parametric identification procedures of Wiener class system were presented quite recently [5, 6, 7]. In this paper we apply Wiener class system consisting of linear dynamic subsystem coupled with a sigmoid nonlinearity to speech modelling. Initial experiments are performed on a Lithuanian vowel “ū”. Basing on the results and observations made, we extend our experiments to other vowels taken from different words. The results are presented, where the performance of Wiener system model is opposed to widely employed Linear Prediction Coding (LPC) model. Basing on the found properties Wiener system model is proposed to be used for speech prediction. Presented experimental simulations demonstrate increase of accuracy at selected signal windows thus confirming proposed approach. Application of results in telecommunication systems could possibly decrease data transfer rates, increase speech intelligibility and naturalness of speech synthesis. Setup for speech signal prediction


Introduction
Speech technologies are widely employed in modern telecommunication systems.Received speech signal quality [1], spoken word recognition accuracy [2] and successful speaker identification correctness [3,4] strongly rely on speech modelling precision.Even modern audio coders are based on linear models that are used to simulate speech, which in general is non-linear and non-stationary.Therefore, non-perfect modelling occurs.
As majority of real world systems are not linear in nature, nonlinear methods for their modelling are required and application range of those models is very wide.Therefore, in a research literature of a last decade Wiener class systems (see Fig. 1), defined as linear time invariant, dynamic, causal and stable subsystem followed by static nonlinearity, have drawn considerable attention.Several possible parametric and non-parametric identification procedures of Wiener class system were presented quite recently [5,6,7].
In this paper we apply Wiener class system consisting of linear dynamic subsystem coupled with a sigmoid nonlinearity to speech modelling.Initial experiments are performed on a Lithuanian vowel "ū".Basing on the results and observations made, we extend our experiments to other vowels taken from different words.The results are presented, where the performance of Wiener system model is opposed to widely employed Linear Prediction Coding (LPC) model.Basing on the found properties Wiener system model is proposed to be used for speech prediction.Presented experimental simulations demonstrate increase of accuracy at selected signal windows thus confirming proposed approach.Application of results in telecommunication systems could possibly decrease data transfer rates, increase speech intelligibility and naturalness of speech synthesis.

Setup for speech signal prediction
The Wiener system model we use consists of infinite impulse response dynamic subsystem in the linear part and a sigmoid function in the nonlinear part.The intermediate signal of Wiener system model is expressed as here   , , , and , , , are dynamic system coefficients; q is time shift operator; n is order of the dynamic system.The used in Wiener system model nonlinearity is defined by here   y k is output signal;

 
f  is monotonous and invertible sigmoid function;  and  are gain and bias of sigmoid function, respectively.The sigmoid function dependence on these parameters is demonstrated in Fig. 2.
In order to find parameters of the Wiener system model applied to predict signal, a fragment of it is used to generate the training dataset.The output signal is advanced in time via l samples in respect to input signal.It is iterative procedure and takes much more calculations for identification.However, it can be implemented in field programmable gate arrays that can be used for acceleration of parameter identification [8].Speech data.The experiments are performed using 16 bit mono signal quality with 11.025 kHz sampling frequency audio records of Lithuanian words "kultūra" and "įmonė".The speech is produced by an adult male Lithuanian speaker.The records are done in a non-noisy environment.
Model initialization.14th order of the LPC model was chosen for the comparison purposes.The same order was chosen for the linear dynamic part of Wiener system model.The initialization of linear dynamic system coefficients can be done randomly; however, here the LPC coefficients were used.As the bias in signals was removed, we set  to 0 as its initial value.Parameter  is set to value 2, for tangent to be at 45 degrees at point 0.

Investigation of single vowel prediction
The first group of experiments is devoted to see the influence training procedure on future signal error.For this reason the MSE of learning, testing and validation datasets are observed during the training procedure.The length of the datasets is 256 samples.The validation dataset window is shifted by 256 samples and testing dataset window is shifted by 512 samples to the future in respect to training dataset window (see Fig. 3).
Ten experiments were performed on vowel "ū" from the word "kultūra".The dataset windows were shifted 10 times by 10 samples more in respect to each other.The achieved prediction results are summarized in the Table 1.Moreover, in Fig. 4     It is worth to notice, that the smallest errors of LPC model (Experiments 1 and 9) are never outperformed by Wiener system model.This peculiarity suggests an idea that chosen nonlinearity do not suites well for vowel "ū" signal and no reduction of MSE for the training dataset can be possible, as linear dynamic subsystem will always have MSE smaller or equal to the LPC model.To confirm this, a further investigation should be performed on the choice of nonlinearity depending on the initial signal phase in a window.
The results in Table 1 seem to be strongly dependent on the initial phase of the speech signal to be predicted.Comparing means and population standard deviations of both models confirm, that the Wiener system model demonstrates more consistent results on testing dataset -it outperforms LPC model in 8 experiments.Thus the mean of MSE ratio (LPC vs Wiener system models) for the Wiener system model are smaller by approximately 18 % (training dataset), 17 % (validation dataset) and 14 % (testing dataset).

Investigation of multiple vowel prediction
The second group of experiments is devoted to investigate if the use of sigmoid nonlinearity can give advantage on other vowels.For this, prediction errors of LPC, Infinite Impulse Response (IIR) and Wiener system models in the signal windows shifted to future in respect to the training dataset window will be analysed.
The experiment was performed by identifying the parameters of the models from 256 sample size training dataset and employing them to predict a 1025 sample size signal window consisting of a same training dataset and 769 future samples, which give us an insight of prediction accuracy of 70 ms of the future signal.Six experiments were done where vowels from the Lithuanian words "kultūra" and "įmonė" were used.To exploit the peculiarity the MSE on training dataset of Wiener system model was compared with the MSE of IIR model.In case the latter was smaller or equal to the one of the Wiener system model, the dataset windows are shifted by 10 samples.The result of each experiment is 1024 sample size error vector, which was divided to four equal parts and MSE value was calculated for each of the quarter.The resulting data can be seen in Table 2.
The first quarter is actually the data the models were trained on.Because of this and additional 14 parameters the results of IIR model has to be better than the ones of LPC model.Meanwhile, the smaller MSE of the Wiener system model were guaranteed by the constraint of the experiment (see data in bold).
The second, third and fourth quarters' MSE shows how well the models perform on future data that have not been seen during the training procedure.The ratios of MSE between LPC and IIR models are very close to 1 in all vowels except "į".That could mean that the additional parameters of a linear subsystem are not very useful for the prediction of future signal.
The Wiener system model performs better than the IIR model predicting future samples of all tested vowels with an exception of "į", too.An additional shift of signal   window by 10 samples was tried.It gave smaller MSE ratios of Wiener system model than of IIR model and confirmed the dependence of the results on initial signal phase and not on the vowel.It seems that Wiener system model benefits from its sigmoid nonlinearity where additional parameters of IIR model gave only marginal improvement.The average MSE ratios of a Wiener system model are less about 5 %, 8 % and 8 % for the second, third and fourth quarters respectively.
In Fig. 5 the change of MSE is depicted shifting error calculation window in time.In Fig. 5a it can be clearly seen that the MSE of the Wiener system model is higher in comparison of both other models at a time region of about 7 ms.At that time, the models had to predict signal consisting of about 13 ms of training signal and 7 ms of future signal.The advantage of Wiener system model can be clearly seen starting from the shift about the 10 ms until the end of the data.In Fig. 5b the MSE of vowel "į" modelling is shown.The Wiener system model outperforms the LPC model.However, the IIR model gives better MSE during all the modelling of the signal.Thus, sigmoid nonlinearity does not give an improvement in this case and a different type of nonlinearity might be needed to archive that [5][6][7].

Conclusions
1.For vowel prediction LPC model could be replaced by Wiener system model.Experiments with six Lithuanian vowels on average gave more than 13 % improvement in MSE of prediction.The average of MSE for vowel "ū" prediction with the Wiener system model were smaller by approximately 18 % for the training dataset, 17 % for the validation dataset and 14 % for the testing dataset.
2. A nonlinearity employed in Wiener system model plays a major role on the performance of the system in comparison with LPC and IIR models.Badly chosen nonlinearity can limit short-term prediction capabilities as confirm results of two experiments with vowel "ū" or impact long-term prediction as show experimental results with vowel "į".
3. The use of Wiener system model instead of IIR model improves MSE of long-term prediction of five Lithuanian vowels on average by more than 5 %.
A further investigation of Wiener system model should be performed on the choice of nonlinearity taking into account the initial speech signal phase to be predicted.

Fig. 2 .Fig. 1 .
Fig. 2. Nonlinear function (2) dependence on values of  and  parameters there are shown two cases of model identification process that result: a) in Wiener system model perfection on the validation dataset (Experiment 2); b) in LPC model perfection on the training dataset (Experiment 9).

Table 1 .
Vowel "ū" prediction results comparing MSE of LPC and Wiener system models on training, validation and testing data * -two best prediction results; ** -population standard deviation; data in bold -extreme values

Table 2 .
Six vowel prediction results comparing MSE of IIR, LPC and Wiener system models by the quarters of error vectors