Psychological Stress Detection in Speech Using Return-to-opening Phase Ratios in Glottis

This paper is focused on investigation of psychological stress in speech signal using shapes of normalised glottal pulses. The pulses were estimated by two algorithms: the Direct Inverse Filtering and Iterative and Adaptive Inverse Filtering. Normalised glottal pulses are divided into opening and return phase, and a feature vector characterizing each glottal pulse is calculated for a series of n percentage interval in time domain. Each feature vector is created by parameters describing its return to opening phase ratio, namely chosen intervals, kurtosis, skewness, and area. Further, psychological stress is detected by feature vector and four different classifiers. Experimental results show, that the best accuracy approaching 95 % is reached with Gaussian Mixture Models classifier. All the best results were obtained regarding only the interval of 5 % from both phase durations, i.e. for and after pulse peak, where the most significant differences between normal and stressed speech in feature vector are occurred. Presented experiments were performed on our own speech database containing both real stressed speech and normal speech. DOI: http://dx.doi.org/10.5755/j01.eee.21.5.13336

1 Abstract-This paper is focused on investigation of psychological stress in speech signal using shapes of normalised glottal pulses.The pulses were estimated by two algorithms: the Direct Inverse Filtering and Iterative and Adaptive Inverse Filtering.Normalised glottal pulses are divided into opening and return phase, and a feature vector characterizing each glottal pulse is calculated for a series of n-percentage interval in time domain.Each feature vector is created by parameters describing its return-to-opening phase ratio, namely chosen intervals, kurtosis, skewness, and area.Further, psychological stress is detected by feature vector and four different classifiers.Experimental results show, that the best accuracy approaching 95 % is reached with Gaussian Mixture Models classifier.All the best results were obtained regarding only the interval of 5 % from both phase durations, i.e. for and after pulse peak, where the most significant differences between normal and stressed speech in feature vector are occurred.Presented experiments were performed on our own speech database containing both real stressed speech and normal speech.

I. INTRODUCTION
The first application of glottal pulses can be found in speech synthesis where precise understanding of glottal pulses and its estimation lead to high-quality synthetic speech.For instance, the novel method called Glottal Spectral Separation (GSS) is published recently by Cabral et al. [1].By suitable combination of mixed excitation model and noise component, the high-quality speech can be produced by the GSS method using suitable combination of mixed excitation model and noise components.Another method of speech synthesis was introduced by Raitio et al. [2], where synthetic voice is utilized by Hidden Markov Models (HMM) and Iterative and Adaptive Inverse Filtering (IAIF) leading to subjectively highly natural synthetic speech.Similar HMM-based speech synthesizer based on the Liljencrants-Fant (LF) model of the glottal flow is published by Cabral et al. [3].Glottal pulses can be also used in music, for instance the speech (sing) resynthesis [4].
The next application field of glottal pulses is so-called expressive speech processing used for expressing emotions, dynamic and varying voice quality and articulation during phonation.In 1980, the dynamic changes varying on phonation type, exactly on glottal source signal, was published by Laver [5].Differences between prosodic and glottal feature were statistically processed and published in [6], where glottal feature shows significant differences for all 30 emotion pairs contrary to prosodic features.Using the suitable combination of prosodic and glottal features for emotion recognition is also described in [7], where Support Vector Machine (SVM), Artificial Neural Network and Gaussian Mixture Models (GMM) classifiers were applied on Berlin emotional speech database.The symmetry of glottal pulse shape has been used for recognizing between six spoken emotional states [8] reaching average efficiency 66.5 % for well recorded speech signal and 47 % for noisy speech respectively.A number of observed parameters including glottal features are described in [9] varying on the type of psychological stress influence.A set of chosen speech features was tested and observed by different classifiers for gender and emotion recognition [10].
Glottal flow analysis can be also applied in speaker recognition [11].The efficiency of glottal source component derived from Linear Prediction (LP) residual was preliminary experimentally tested for speaker recognition using Auto Associative Neural Networks models on total 20 speakers [12].Other studies using, for instance, Glottal Flow Cepstrum Coefficients [13] and vocal source model [14] were experimentally tested in the case of speaker recognition.
Glottal pulse analysis can be applied also in biomedical field.Recently, the detection of Parkinson's disease from dysphonia measurements is described as a promising intermediate phase to non-diagnostic diagnostic method [15].Glottal pulses can be also utilized for analysis vocal disorders [16], alcohol intoxication [17] as well as for Alzheimer's disease detection [18].Other possible disease detection using voice analysis can be found in the review published by Saloni et al. [19].In general, a survey oriented on glottal source processing and its applications was written by Drugman et al. [20].

II. MINING THE GLOTTAL PULSES
Despite the years to be the research of obtaining the real glottal course from speech signal worked on, recently best results are reached only for the base of glottal flow estimation.Glottal flow can be characterized by a set of glottal pulses repeated by fundamental period T.

Psychological Stress Detection in Speech Using
Return-to-opening Phase Ratios in Glottis An example of glottal flow is illustrated in Fig. 1.Briefly, the whole glottal pulse is composed by two instancesprimary opening To and return Tr phase.The space between particular pulses is called closed phase Tc during which the glottis is closed and the air does not flow through the gap.
Detail description of each individual glottal flow part including physical changes and processes can be found in [21].The mostly used methods for the estimation of glottal flow are DIF (Direct Inverse Filtering) and IAIF.Both methods are based on all-pole modelling of speech signal using LP analysis for considering the transfer function of vocal tract with impulsive or periodic source substituting glottis.The topic of inverse filtering and its impact on voice research and therapy was by Lofqvist [22] and Nwachuku [23].Other methods of glottal pulses estimation are described in [24].
Basically, the DIF method can be classified as a traditional autoregressive modelling-based inverse filtering method [25].The IAIF estimating method can be simply described as a suitably connected serial-parallel combination of DIF pair.This method is described in detail in [26], characterized by better results of glottal flow estimation and it is more computationally challenging.All analysed glottal pulses are mined in speech by modified version of software Aparat [21] where all pulse estimation is based on four parameters LF glottal model [27].Thus, beginning, maximum position and the end of analysed glottal pulse (its absolute height and width) are defined by mined LF model and used in further processing.
Under psychological stress, people tend to make syllables, exactly entire words, shorter, therefore the stability of glottal pulse uniqueness has been also observed in dependency on the word duration.The first row of Fig. 2 illustrates the signal form (light grey) of the Czech syllable "ču" containing mainly the vowel /u/ and estimated glottal pulses.The glottal flow of shorty spoken word is black (left column), the longer version of the same word is dark grey (right column).In the next subfigure (second row), both waveforms of estimated glottal flows are illustrated in the same time scale for showing the shape differences and fundamental periods Tshort and Tlong of the short and long versions of the same word, where fundamental frequency of shortly spoken word is little bit higher (82 Hz versus 78 Hz).Obviously, the glottal pulses do not vary upon the duration of spoken word nor fundamental period which leads to finding verification by two-dimensional normalisation of mined glottal pulses showed in the last row, where for example five normalised pulses are set over themselves for each speech tempo (see Fig. 2).Due to this fact, only individual glottal pulses are used in further experiments and not the whole glottal flow periods where the time interval of closed glottis is seemed to be not so representative in the case of characterizing the actual state of speaker.

III. GLOTTAL FEATURE EXTRACTION
This section describes the method used for extracting chosen parameters, exactly their ratios.Basically, used method exploits only glottal pulses, composed by return and primary opening phase Tr and To (see Fig. 1).Each of mined glottal pulses is normalised to value 1 in time and amplitude domain leading to dimensionally uniform glottal pulses keeping original shape.In these two-dimensionally normalised pulses, the primary opening and return phase are processed separately.Both phases are transferred into relative time scale reaching the zero level at the position of current pulse's peak and the maximum (100 %) in both directions, i.e. at the end and at the start of current phases.
Used extraction method is based on the observation of both phases only for selected relative division n. Figure 3 shows the main idea of n-percentage glottal pulse processing of particular primary opening To(n) and return phase Tr(n) leading to following equation where RTO is Return-To-Opening phase ratio of current n-percentage interval always symmetric for both used phases in relative scale.Area, skewness and kurtosis (the third and fourth standardized Pearson's moments) are further calculated for both n-percentage intervals.Finally for each mined parameter value, the RTO phase ratio is calculated to sign the domination level of one n-percentage interval of current parameter.Obviously, each part of pulse curve (thick line segment in Fig. 3) corresponding to the n-percentage division is characterized by three different RTOs (kurtosis, skewness and area).These feature values are further used for processing and observing differences between normal and stressed speech.For example, the real values of investigated RTOs are listed in Table I, where all values are based on 5 % interval and averaged over all speakers.

IV. REAL STRESS DATABASE
Research presented in this paper has been performed on created database containing speech under real psychological stress as well as normal speech.The first part of used database is formed by 18 different Czech speaking male speakers from ExamStress database [28] previously used for observation of vowel polygon differences varying on speaker's state [29].Second part of used database is formed by another 6 Czech male speakers recorded by microphone PCB 378B02 suitable for infrasonic applications and sound interface USB-9234 produced by National Instruments.All Czech speakers in both parts of used database were recorded during the thesis defence in frame of final exam for capturing the real psychological stress influence.Few days later each speaker repeated the same text in more self-comfortable conditions for recording speaker's normal mood.

V. EXPERIMENTAL RESULTS
This section describes results achieved by realized experiments.In fluent speech performed by second part of used database (six speakers), Czech vowels were automatically detected and separated for further processing [30].Then, separated vowels were manually divided to begging and centre vowel parts from which glottal pulses were estimated by DIF and IAIF methods.From the first part of used database, vowels were separated manually in fluent speech and further were processed similarly to achieve the most pure training data further used in designed classifiers.
For naturally dynamic speech, the efficiency of emotional state (stress and normal mood) recognition was achieved for two types of glottal flow estimation methods (DIF and IAIF) in beginning and centre vowel part for 20 different npercentage intervals (5 % to 100 % by step 5 %).Mentioned ways of efficiency testing were also applied on each 10 ms segment of normalised speech leading to the impact observation of glottal pulse uniformity on dynamic range limitation.The efficiency, exactly the uniformity of glottal pulses under normal and stress conditions, is tested by four different classifiers embedded in standard MATLAB version and further appropriately trained, validated and applied.
In following text, the recognition of used various glottal pulses are defined as:  The k-Nearest Neighbour (kNN) was chosen as the first classifier.The best results are reached for the 5 % observed interval of glottal pulses, where the most significant differences are occurred between normal and stressed speech.Almost the efficiency of 95 % is reached by Method 2 on 5% selected interval.Further, accuracy over 90 % is reached by Method 1 and the Method 4 for 5 % and 10 % selected intervals.This method is the most successful on higher n-percentage intervals, where its recognition efficiency lies between 70 % and 80 %.The worst efficiency of kNN classifier was reached by Method 8 reaching efficiency values lower than 40 % for higher n-percentage intervals.Generally for kNN, the average recognition efficiency is approximately 60 % over all used methods and intervals.
The efficiency of stress detection of chosen classifier and actual n-percentage interval and method is calculated as follows ( ) 100, where Nn is the total number of used normal state glottal pulses, Ns the total number of glottal pulses under psychological stress, Ncdn is the number of correctly detected normal mood glottal pulses and Ncds is the number of correctly classified stressed glottal pulses.(3) Significant efficiency increase was obtained by the SVM classifier where average efficiency value approaches to 70 % over all methods and intervals.Generally, efficiency reached by SVM can be regarded as more satisfactory with more possible used n-percentage glottal pulse intervals for correct psychological stress detection.The best results approaching 95 % accuracy are received by Method 4 for 5 % interval as well as in the selected interval range 75 %-95 %.The most significant differences between observed features ratios of normal and stressed speech can be found in 5 % (Method 6 and Method 7) and 65 % (Method 3) selected intervals where also accuracy approaches to 95 %.
As the third classifier, GMM was used.The high efficiency values of psychological stress detection were reached over all possible n-percentage intervals of glottal pulses.On the other hand, the generally lowest accuracy values are also achieved by GMM, exactly for 10 % (Method 5) and for 35 % (Method 1) selected intervals where the accuracy approaches only to 10 % in stress detection.In some selected intervals, each method reaches the efficiency almost 95 % which signs the highest uniformity of observed features varying on actual state of speaker, and targets on GMM as a suitable classifier for stress detection.The average efficiency value over all used methods and interval approaches 82 %, but the best results were achieved by Method 4.
Figure 4 illustrates reached efficiency for Method 3 and its sound normalised equivalent Method 4 for showing the impact of sound normalisation on stress detection in the form of more stable reached efficiency results.Obviously, the best and the most constant results can be found in the range of n-percentage intervals 50 %-90 % for Method 4. Obviously, the chosen method and classifier are not important in the case of stress recognition as well the appropriate selected interval, but generally the best results are reached by GMM classifier.
The final sorting of used types of stress detection parameters is listed in Table II, where due to the total number 640 of all used types, only the first fifteen (best) and the last five (worst) positions are listed.All types are written only in abbreviations, e.g.GMM_5_D_C_N represents GMM classifier applied on 5 % selected interval, DIF estimation method, the vowels' centre and normalised sound (Method 6).The values of total analysed normal Nn and stressed speech Ns glottal pulses are listed in Table II   Obviously, the best results are reached by Method 4 and GMM classifier in general.As the best n-percentage interval can be marked 5 % sector, but the most stress recognition stable range is from 50 % to 80 %.

VI. CONCLUSIONS
By the comparison of eight different glottal pulse estimation methods and four classifiers, the GMM classifier can be marked as the best for stress recognition with method estimating glottal pulses by IAIF algorithm from normalised sound vowel's beginning.Obviously by presented RTOs:  the IAIF estimation is more suitable than DIF algorithm,  stress influence is better detectable at vowels' beginning,  sound normalisation leads to more stable efficiency results,  the biggest differences in RTOs between normal and stressed speech lie in 5 % interval as well as in 65 %.
Generally, presented approach corresponds with the similar method detecting stress by means of glottal pulse distribution [31].However, presented experiments show higher accuracy (95 %) as the accuracy of 88 % published in [31] or in [32], where Glottal Spectral Slope reached stress detection ratios in the range 18 %-36 %.
Obviously, the combination of automatic vowel detection, e.g.[30], and findings presented in this paper can lead to development new systems recognizing psychological stress in speech which can negatively influence human behaviour.These systems can be practically applied in many fields of usage e.g. machine control, medical applications, etc.
Further, it is necessary to expand real psychological stress database to verify experimentally presented results.In future, described method will be also expanded and adapted to its usage on all estimated glottal pulses in all voiced parts of speech, i.e. not only on found vowels.This modification can lead to higher amount of estimated glottal pulses and to observation if described methods are phoneme-independent in the case of psychological stress detection in speech.

Fig. 2 .
Fig. 2. Differences of glottal pulses depending on the speech tempo.Black signs the fast speech, i.e. shorter version of the same spoken word.

Fig. 3 .
Fig. 3. Division of two-dimensionally normalised glottal pulse into n-percentage particular intervals of opening and return phase.

Fig. 4 .
Fig. 4. Efficiency of stress detection for Method 3 (dashed grey line) Method 4 (solid black line) depending on selected n-percentage interval for using the GMM classifier.The Probabilistic Neural Network (PNN) was used as the fourth classifier.Comparing to previous results, similar observations were occurred.The highest uniformity of RTOs varying on speaker's state can be found in the usage of 5 % selected interval (Method 2, Method 4 and Method 7) and higher intervals 75 %-100 % used only by Method 4. Absolutely highest accuracy (almost 94 %) in stress detection is achieved by Method 2. The worst efficiency results were achieved in intervals higher than 70 % by Method 3 and Method 8.These both methods are not suitable for psychological stress detection with the PNN classifier.Generally, the average efficiency of PNN classifier using RTOs is approximately 62 %.

TABLE I .
AVERAGED REAL VALUES OF THREE RETURN-TO-OPENING PHASE RATIOS IN 5 % SELECTED INTERVAL, IAIF METHOD, NORMALISED SOUND VOWEL'S BEGINNING.
as well as false detected normal N'cdn and stressed N'cds glottal pulses.

TABLE II .
FINAL SORTING OF USED TYPES.