Lithuanian Speech Synthesis by Computer using Additive Synthesis

We present a new Lithuanian speech phoneme synthesis method based on the principle of additive synthesis in this paper. An assumption is made that phoneme models consist of the sum of harmonics which could be generated by properly chosen formant synthesizer parameters. In order to estimate the synthesizer parameters, we use the real sound signals that are expanded into harmonics by the inverse fast Fourier transform method. The harmonic synthesizer parameters (amplitudes, damping factor, and phases) are estimated by Levenberg-Marquardt method. We present an example of the synthesized female vowel /a/ and compare it with the true sound signal. DOI: http://dx.doi.org/10.5755/j01.eee.18.8.2631


I. INTRODUCTION
Much attention in Lithuania is given to processing Lithuanian speech and applying the results for speech recognition, animation, analysis and synthesis.The most efforts are devoted to speech recognition.The latest work is concentrated in the fields of developing new methods for speech recognition feature quality measurement, control of computer and electric devices by voice, etc.
Speech animation problems attract also attention of Lithuanian researchers.One of such problems is Lithuanian phoneme visualization.A methodology of such visualization is proposed in [1].Some researchers try to use Wiener class systems for speech signal prediction.
An important class of speech processing problems is speech synthesis.Speech synthesis methods can be divided into two main groups: parametric synthesis methods and concatenation synthesis methods.In parametric speech synthesis, a speech signal is represented by a finite number of parameters.In concatenation synthesis, a sound is created with a help of a predefined vocabulary of the initial synthesis elements.The most known work in the field of text-to-speech synthesis of Lithuanian language is [2].Concatenation synthesis is used in this work.The practical implementation of the methodology proposed in [2] can be seen in [3].The main concatenation synthesis problem is the size of the memory for storing the vocabulary.The synthesized speech quality does not achieve the natural speech quality since glitches occur on the concatenation boundaries.
A subgroup of parametric synthesis methods is the socalled formant synthesis methods.These methods are based on the decomposition of a speech signal into formants (signals described by quasipolynomial models) [4]- [7].A vowel formant synthesizer has been developed in [4].The authors of [4] proposed to synthesize Lithuanian speech vowels in the two frequency ranges: the low frequency range (1-900 Hz) and the high frequency range (900-2400 Hz).The examples of two synthesized vowels /a/ and /i/ have been presented.This algorithm gave satisfactory results (the naturally sounding vowel models), although certain parameters were not selected automatically.The authors of [5] did not divide the frequency range into two parts and excited each formant separately.In order to improve parameter estimates, the optimization procedure has been introduced in [5].The purpose of this optimization is to reduce errors due to data convolution.The amplitudes of actual formants have been used as inputs, and the distance between the inputs was equal to the main pitch periods of the original sound.The models of the excitation sequences and fundamental frequency dynamics, however, have not been developed.
Formant diphthong model has been developed in [6].For developing of this model, the transition between the diphthong vowels has been described using the arctangent function.An algorithm for the estimation of parameters has been derived from convoluted data, and it was applied for synthesizing of the diphthong /ai/.The same methodology as in [6] has been applied for joining of vowels, diphthongs and semivowels [7].
The synthesized sounds obtained by methods of [4]- [7] have sufficiently natural sounding.The synthesizing procedures described in [4]- [7], however, have a disadvantageit is difficult to determine the formant ranges and decompose the signal into formants as there is uncertainty related with hidden and merged formants.
In this paper, as an alternative to the expanding of a signal into formants, we use the expanding of a signal into harmonics.Since the distance between the harmonics frequencies is known, then the expanding of a signal into harmonics can be done without uncertainty.

II. ADDITIVE SYNTHESIS OF A SPEECH SIGNAL
A short-time speech signal can be regarded as periodic.The periodic character can be seen in Fig. 1 where a female vowel /a/ and male vowel /a/ are shown.Mathematically, a periodic signal ( ) satisfies the following relationship A periodic function can be expanded into a Fourier series where ⁄ is the signal fundamental frequency, is the amplitude of the k-th harmonic, is the phase of the kth harmonic.
A finite number of harmonics is used to synthesize speech sounds using (2) because very high harmonics do not almost affect the speech signal sound.Then the relationship (2) is changed as follows One can encounter various transitions in speech signals (from one phoneme to another, from the phoneme beginning to the phoneme end, amplitude jumps in a stressed syllable, etc.).The phases are also important in order to avoid phase distortions in transitions.The sound synthesized using (3) has strong synthetic shade.Therefore in order to get a natural sounding, it is assumed that the harmonic amplitudes and the fundamental frequency are functions of time.Then the relationship (3) turns into the following The speech sound synthesis by ( 4) is carried out in two steps: 1) K harmonics are synthesized, and 2) the synthesized harmonics ( ) are summed.Fig. 2. The speech synthesis scheme using (4).

III. GENERATING OF SPEECH SIGNAL HARMONICS USING THE FORMANT SYNTHESIZER
It is not difficult to show that a formant synthesizer [6] can generate sinusoid-type signals by properly choosing parameters.Therefore we suggest to use a formant synthesizer for harmonic synthesis, i. e. to use a linear system with unit pulse inputs whose impulse response is a third order quasipolynomial where i is the harmonic number, tcontinuous time, f dthe sampling frequency, f ithe harmonic frequency, the damping factor, a i1 , a i2 , a i3 , a i4the amplitudes, , , the phases.The computations are carried out not with continuous functions but with the sequences obtained by sampling continuous-time signals.We therefore use the following discrete-time state space model [8]: where F is a block diagonal matrix made of K Jordan blocks: where Kthe total number of harmonics, Gthe block where i=1,2,…,K, j=1,…,4.
In order to use (6) for harmonic synthesis, one has to estimate all the parameters of the system.

IV. ESTIMATION OF PARAMETERS OF THE FORMANT HARMONIC SYNTHESIZER FROM REAL DATA
To estimate the model parameters, we first expand the speech signal into harmonics using rectangular filters that are implemented by the inverse Fourier transform.The magnitude response of the female vowel /a/ is shown in Fig. 3. and the signal harmonics in Fig. 4.  In order to estimate the formant filter parameters, we select a "representative" pitch (Fig. 5).For this purpose, we select the minimum point 1, then go up until the first maximum 2, and go down to the first minimum 3.
A 'step-by-step' Levenberg-Marquardt type algorithm described in [8] is used for parameter estimation from convoluted data.The estimation of parameters of the harmonic synthesis system is carried out in parallel for each harmonic data.The seventh speech signal harmonic and the seventh harmonic of the synthesizer output signal are shown in Fig. 6.In order to check the synthesizer accuracy, we carried out the following experiment: three unit pulses with the period ⁄ were sent to the synthesizer input and the output signal was compared with the "representative" pitch.The true and model signals are shown in Fig. 7.

V. CONCLUSIONS
A new harmonic synthesis method belonging to the additive synthesis class is proposed as an alternative to the formant synthesizer.The experimental results show that the synthesized sounds of this method are sufficiently natural, pleasantly sounding.The third-order polynomial models are used for amplitudes and periods of excitation pulse sequence dynamics modelling.
Although the synthesizer model seems to be of a high order and has an excessive number of parameters, it performs an important function providing naturalness to the synthesized speech.When necessary, the model can be

Fig. 6 .
Fig.6.The seventh speech signal harmonic and the seventh harmonic of the synthesizer output signal ('+'the data, the solid linethe synthesizer output).