Comparison of Formant Features of Male and Female Emotional Speech in Czech and Slovak

The paper describes analysis and comparison of formant features comprising the first three formant positions together with their 3 -dB bandwidths and the formant tilts. These features were determined from the smoothed spectral envelopes or directly calculated from the complex roots of the LPC polynomial. Subsequently, statistical analysis and comparison of the formant features from emotional speech representing joy, sadness, anger, and a neutral state was performed. In this experiment we use the speech material in the form of sentences uttered by male and female professional speakers in Czech and Slovak languages. For detailed analysis, the derived speech database consisting of manually selected sounds corresponding to the stationary parts of five vowels and two nasals was created. The determined formant positions and their value ranges are in correspondence with the general knowledge for male and female voices. Obtained statistical results and values of parameter ratios will be used for emotional speech conversion or they can also be applied for extension of the text-to-speech system enabling expressive speech production. DOI: http://dx.doi.org/10.5755/j01.eee.19.8.1739


I. INTRODUCTION
Identification of emotions in speech depends on the chosen set of features extracted from the speech signal.These features are systematically divided into segmental and supra-segmental ones [1].Short-term segmental features derived from speech frames with short duration are usually in relation with the speech spectrum.These include traditional features like linear predictive coefficients, line spectral frequencies, mel-frequency cepstral coefficients (MFCC), linear prediction cepstral coefficients [2], or unconventional ones like perceptual linear predictive coefficients, log frequency power coefficients, etc. [3].Supra-segmental features comprise statistical values of parameters describing prosody by duration, fundamental frequency, and energy.This category comprises also a separate group of features constituting voice quality parameters: jitter, shimmer, glottal-to-noise excitation ratio, normalized amplitude quotient, spectral tilt, and spectral balance [2].
Emotion detection from speech is usually carried out with combination of various features among which almost invariably the MFCC are used [4].Improvement in speech emotion recognition can be attained by using various sets of new features, e.g.harmony features [5] or semantic labels [6].However, within the standard features the formants convey information about the speaker's vocal tract which differs not only for different speakers but it changes its shape for the same speaker in different emotional state.
This paper describes analysis and comparison of the formant features (FF) of male and female Czech and Slovak acted speech in four emotional states: joy, sadness, anger, and a neutral state.Czech and Slovak languages (belonging to the Slavonic languages) are similar but different, therefore we can use a common speech corpus to obtain spectral parameters, but on the phonetic and prosodic level the synthetic speech must be processed separately.Motivation of our work was to find out the parameters for extension of the text-to-speech (TTS) system enabling expressive speech production [7].The obtained parameter ratios between male and female voices as well as ratios between emotional and neutral speech will be used in emotional speech transformation (conversion) method.

II. METHODS USED FOR SPECTRAL ENVELOPE SMOOTHING
The FF consist of the basic frequency parameters, as the first three formant positions (F 1 , F 2 , and F 3 ) together with their bandwidths, and the complementary parameters (usually the formant tilts) that can be calculated by several techniques [8].In practice two basic approaches to FF determination are mostly used: the first one calculates them from the complex roots of the LPC polynomial; the second one consists in finding of the local maxima of the smoothed spectral envelope where its gradient changes from positive to negative.
The formant positions and their bandwidths are prevailingly determined from the smoothed envelope of the voiced parts of the speech signal.In our case, the periodogram uses an N FFT -point FFT to compute the power spectral density (PSD) of the input speech signal as S(e jω ) / f s where f s is a sampling frequency.
The smooth spectral envelope of the speech signal can also be determined during cepstral analysis [9].Cepstral analysis of the speech signal is performed in the following way: first, the complex spectrum using FFT algorithm is calculated from the input samples (after segmentation and weighting by a Hamming window).In the next step, the log power spectrum is computed.Application of the inverse FFT algorithm gives the symmetric real cepstrum.By limitation to the first N 0 + 1 coefficients, the truncated cepstrum represents approximation of a log spectrum envelope ( ) where the first cepstral coefficient c 0 corresponds to the signal energy.Formant frequencies and bandwidths can be determined simply using an autoregressive (AR) model well known in speech processing as a linear predictive coding (LPC) model being an all-pole model of a vocal tract.The autocorrelation method uses the Levinson-Durbin recursion to compute the parameters {a k } describing the speech spectral envelope ( ) j S e ω ( ) where N A is the order of the AR model.

III. CALCULATION AND EVALUATION OF FORMANT FEATURES
We apply two methods for basic formant features determination: 1) Indirect -the formant positions as the first three local maxima of the smoothed spectral envelope where its gradient changes from positive to negative, corresponding bandwidths are obtained as the frequency intervals between the points of the 3 -dB decrease of the magnitude spectrum from the formant amplitudes; 2) Immediate -the estimation of the formant frequencies and their bandwidths directly from the complex roots of the LPC polynomial A(z).
Using the Newton-Raphson or the Bairstow algorithm [10] we obtain the complex root pairs corresponding to the poles of the LPC transfer function.The formant frequency F n and the 3 -dB bandwidth B n in [Hz] can be determined by where θ n is the angle in [rad] of the complex root.
Indirect determination of the basic FF is realized by combination of all three mentioned approaches for the spectral envelope calculation and smoothing.In the case of the LPC envelope calculated by (3) the higher order is applied; in the case of direct calculation from the roots of the LPC polynomial by ( 4), the lower order is applied.Correctness of the basic FF values obtained by all three indirect methods as well as by direct calculation from roots is assessed by two criteria: 1) The resulting values of 3 -dB bandwidths must be less than 500 Hz [11]; 2) The found values of the first three formant positions must fall within the corresponding frequency interval depending on the voice type (male/female).Resulting values fulfilling these conditions are used finally for next processing.The whole algorithm of the used method of the basic and complementary formant features determination is described by the block diagram in Fig. 1.The complementary FF can be defined as formant tiltsangles between spectrum peaks in the place of determined first three format positions (see documentary Fig. 2).The general bisector formula in the parametric form using the direction k is defined as

Input sentence
where y

IV. MATERIAL, EXPERIMENTS AND RESULTS
In our experiment the used speech corpus extracted from the Czech and Slovak stories performed by professional actors contains sentences with different contents expressed in four emotional states: "neutral state", "joy", "sadness", and "anger" uttered by several speakers (134 sentences spoken by male voices and 132 sentences spoken by female voices, 8+8 speakers altogether).The processed speech material consists of sentences with duration from 0.5 to 5.5 seconds, resampled at 16 kHz.The frame length for spectral analysis depends on the mean pitch period of the processed signal.For spectral analysis we had chosen 24-ms frames for the male voices, and 20-ms frames for the female voices.
Calculation of the FF values was supplemented with determination of the fundamental frequency F0 by autocorrelation analysis method with experimentally chosen pitch ranges as follows: 55 ÷ 250 Hz for the male voices and 105 ÷ 350 Hz for the female ones.Then, the F0 values were compared and corrected by the results obtained with the help of the PRAAT program [12] with similar internal settings of F0 values.For voicing frame classification, the value of the detected pitch period L was used.If the value L ≠ 0, the processed speech frame is determined as voiced, in the case of L = 0 the frame is marked as unvoiced.
From the main speech signal database of spoken sentences, the next one consisting of manually selected ROIs corresponding to the stationary parts of the vowels "a", "e", "i", "o", "u", and consonants "m" and "n" was consequently created for detailed analysis.Number of analyzed voiced frames was in total: a) Male: neutral -5103, joy -4927, sadness -4642, anger -4391.b) Female: neutral -5223, joy -4541, sadness -4203, anger -4349.Partial results of analysis of all voiced frames of the main speech corpus are presented in the form of the box-plot graphs of basic statistical parameters of the F 1,2,3 values for male and female voice determined from the neutral and emotional speech (Fig. 3).Fig. 3. Basic statistical parameters of the first three format frequencies for male and female voices: speech in "neutral" (a), "sad" (b), "joyful" (c), and "angry" emotional styles (d).
The common bar graphs of mean values of the first three formant 3 -dB bandwidths for both voices are presented in Fig. 4. Summary histograms of the first three formant frequencies for the male and female speech in different emotional styles are shown in Fig. 5 (male voice) and Fig. 6 (female voice).Two diagrams of bisectors with directions given by formant tilts from the male and the female voices are shown in Fig. 7. Comparison between obtained mutual formant mean frequencies F 1 / F 2 , F 1 / F 3 , and F 2 / F 3 for different emotional states of male and female voices is presented in Fig. 8. Diagrams of detailed analysis of formant mutual frequency positions for different emotional states of voiced sounds corresponding to vowels "a", "e", "o", and consonant "n" selected from the speech material of male and female voices are shown in Fig. 9 and Fig. 10.The mean emotional-to-neutral formant position ratios between different emotional states and a neutral state for male and female voices are presented in Table I; summary female-to-male ratios of formant positions are shown in Table II.Detailed results of the mean F 1,2,3 frequencies and their 3 -dB bandwidths of the selected voiced sounds in neutral speaking style are shown together with numbers of analyzed voiced frames N F in Table III and Table IV for male and female voices.Summary results of detailed analysis of formant tilts (complementary angles in [deg]) of voiced sounds in neutral speaking style for male and female voices are presented in Table V.      V. DISCUSSION AND Our experiment was aimed at analysis and comparison of the formant features in emotional and neutral speech (voiced parts of recorded speech signal).For speech synthesis the source-filter model with cepstral description of the vocal tract transfer function [9] was applied.The parameters used in the original realization of the cepstral speech synthesizer had been obtained by evaluation of a speech signal in the database of phones uttered by a male speaker in a neutral speech.Subsequently we decided to carry out analysis of the first three formant positions (F 1 , F 2 , and F 3 ) got from speech signals expressing different emotional states and compare results for male and female voices.Results of the first three formant position ratios (see Table I) together with summary of female-to-male ratios (see Table II) will be used for emotional speech transformation and production in the multi-voice TTS system.
Values of the basic formant features obtained from the female voices have higher standard deviation (compare boxplot graphs of basic statistical parameters in Fig. 3), and in correspondence with our expectancy formant frequencies are approximately about 15 % higher than that of the male voices.In addition, obtained results correlate with conclusions in [13] that during pleasant emotions the first formant is falling and resonances are raised.For unpleasant emotions the first formant is rising, and the second and the third formants are falling.We can also conclude that the first formant and the higher formants of emotional speech shift in opposite directions.For pleasant emotions the first formant shifts to the left, and the higher formants shift to the right.For unpleasant emotions the opposite situation occurs: the first formant shifts to the right, and the higher formants shift to the left.Contrary to it, the values of the formant 3 -dB bandwidths have no correlation with the type of the speaking style or the type of the voice (see common bar graphs in Fig. 4).On the other hand, the comparison of the formant tilts shows good differentiation between neutral and emotional styles (Fig. 7) for both voices.
Results of detailed analysis of basic five vowels indicate differences between F 1,2,3 positions for neutral and emotional styles which are visible well in the graph of formant mutual frequency positions (Fig. 8 and Fig. 9).However, in the case of the consonants "m" and "n" the differences of the F 1,2,3 values were lower due to smaller absolute amplitudes of the speech signal of the vowels and they cannot be correctly compared visually.These obtained results are in good correspondence with the general knowledge of [14], [15], that vowel formant areas of the male voice lies in the ranges F 1 ≈ 250÷700 Hz, F 2 ≈ 700÷2000 Hz, F 3 ≈ 2000÷3200 Hz, and the female voice vowel formant areas are higher, lying about F 1 ≈ 300÷840 Hz, F 2 ≈ 840÷2400 Hz, F 3 ≈ 2400÷3840 Hz.
Numerical matching of the mean F 1,2,3 values of all voiced sounds in a neutral style also documents sufficient differentiation, again the mean B3 1,2,3 values don't carry this information − see Table III and Table IV.The complementary angles between PSD at frequencies F 1 and F 2 (ϕ' 12 ) and the complementary angles between PSD at frequencies F 1 and F 3 (ϕ' 13 ) have always negative values.The complementary angles between PSD at frequencies F 2 and F 3 (ϕ' 23 ) can have also positive values or values near zero (see Table V).

VI. CONCLUSIONS
Knowledge about the effect of emotional states on speech signals is very important not only for emotion recognition but for standard speech recognition when influence of various factors (including the speaker's emotional state) is taken into consideration as well [16].Considering the fact that our current database contains only speech with acted emotional styles, the analysis of FF properties using also speech material representing real emotions should be recorded.Last but not least, we would like to use broader comparison with other databases in different languages (e.g. the German speech database Emo-DB [17], or international COST 2102 Italian Database of Emotional Speech [18]).
Manuscript received May 23, 2012; accepted April 3, 2013.The work has been supported by the Technology Agency of the Czech Republic (TA01030476), the Ministry of Education of the Slovak Republic (VEGA 1/0987/12), and Grant Agency of the Slovak Academy of Sciences (VEGA 2/0090/11).
,2 represent values of PSD in [dB] of determined formants, and x 1,2 are the positions of the formants on the frequency axis in [Hz].When k < 0, the formants have declining trend, when k > 0, the formants have ascending trend.The resulting angle ϕ in degrees is defined as ϕ = (Arctg(k)/π)•180.Obtained basic and complementary FF values are separately processed depending on a voice type (male / female), subsequently sorted by emotional styles, and stored in separate stacks.The whole evaluation and comparison process of the FF values consists of six steps: 1) Calculation of basic statistical values of formant frequencies and their 3 -dB bandwidth (minimum, maximum, mean values, and standard deviation); 2) Calculation and building of the histograms for F 1,2,3 frequencies; 3) Building of bar diagrams of F 1,2,3 , and B3 1,2,3 values for visual comparison; 4) Building of diagrams of formant tilts (bisectors with directions), and F 1 / F 2 , F 1 / F 3 and F 2 / F 3 mutual frequency positions; 5) Calculation of ratios of the basic FF mean values for emotional and neutral states; 6) Numerical matching of formant tilts (directions and angles between first three spectral maxima of a smoothed envelope).

Fig. 4 .
Fig. 4. Common bar graphs of mean values of the first three formant frequencies (upper), and their 3 -dB bandwidths (lower) for different emotional states of male and female voices.

7 .
Summary diagrams of bisectors with directions given by formant tilts of male (a); and female (b) voices.
To obtain the smoothed spectral envelope, the mean periodogram of the chosen region of interest (ROI) areas − voiced parts of the speech signal − can be computed by the Welch method.The periodogram for an input signal of a sample sequence [x 1 , ... , x n ] weighted by a window [w 1 , ... , w n ] is defined as

TABLE I .
MEAN EMOTIONAL-TO-NEUTRAL FORMANT POSITION RATIOS.
TABLE II.FEMALE-TO-MALE RATIOS OF FORMANT POSITIONS.

TABLE III
. DETAILED RESULTS OF THE MEAN F1,2,3 FREQUENCIES AND THEIR BANDWIDTHS -NEUTRAL SPEAKING STYLE, MALE VOICE.Sound NF [-

TABLE IV .
DETAILED RESULTS OF THE MEAN F1,2,3 FREQUENCIES AND THEIR BANDWIDTHS -NEUTRAL SPEAKING STYLE, FEMALE VOICE.

TABLE V .
DETAILED ANALYSIS OF FORMANT TILTS; COMPLEMENTARY ANGLES IN [DEG] FOR MALE AND FEMALEVOICES.