Analysis of Closing-to-Opening Phase Ratio in Top-to-Bottom Glottal Pulse Segmentation for Psychological Stress Detection

This paper is focused on investigating the differences in glottal pulses estimated by two algorithms; Direct Inverse Filtering (DIF) and Iterative and Adaptive Inverse Filtering (IAIF) for normal and stressed speech. Individual glottal pulses are mined from recorded speech signal and then normalized in two dimensions. Each normalized pulse is divided into a closing and opening phase and further segmented into n‑ percentage sectors in Top-To-Bottom (TTB) amplitude domain. Three parameters, the kurtosis, skewness and pulse area, as well as their Closing-To-Opening phase ratios, are analysed. Designed GMM classifier is trained on speakers from Czech ExamStress database a further applied on other part of ExamStress database and also for English database SUSAS to investigate the independency of presented approach on spoken language and speech signal quality. The results achieved by DIF indicate independency on language and records quality (contrary to methods using IAIF). The best n‑ percentage sectors in the TTB segments can be seen between 5 % and 40 %. In this case, methods based on DIF reached a psychological stress recognition efficiency of 88.5 % in average. The average stress detection efficiency of methods based on IAIF approached 73.3 %. DOI: http://dx.doi.org/10.5755/j01.eie.22.5.16348


I. INTRODUCTION
Current trend is to monitor the actual emotional state of speaker by non-invasive methods like remote analysis of speech signal mostly for the employees of risk professions, e.g.pilots, rescuers, etc., to avoid some dangerous or unpleasant situations.Psychological stress can be classified as an emotion, thus the psychological state influences human behaviour and self-confidence.Due to this reason, it is appropriate to recognize the stress of a speaker immediately, especially in situations when the speaker's behaviour is negatively influenced by distress.
Many methods of stress detection exist and are based mostly on directly mined speech features like MFCC [1], pitch [2], formants [3], etc.Other publications present methods using a set of chosen features, e.g.TEO  spectral centroid, pitch range, etc. [4] or psychological stress recognition based on plane shapes, so called vowel polygons, created by relevant formant values [5].The differences in acoustics-phonetics between stressed and normal speech are described in [6], where enhancing the speaker's pitch is mentioned as the most obvious audible change in stressed speech.Psychological stress recognition is not frequently published and observed, but other used methods, features and databases can be partly found in written surveys oriented on emotional speech recognition [7], [8], or in a survey describing research methods and further steps only for speech under stress [9].
Generally, methods of stress recognition by glottal pulse analysis are less applied, which opens the possibilities to uncover novel observations in this field.For example, Iliev et al. used glottal and other features with optimum-path forest to emotion recognition [10] as well as Muthusamy et al. [11].According to [12], vocal tract cavities are affected by psychological stress which can be detected from LPCs.Glottal analysis complemented by other speech features was published in [13], where accuracy increases from 75 % to 92 % approximately after adding the glottal feature.Another study [14], based only on statistical analysis of glottal pulses using the glottal pulses' fixation in maximum and overlaying, presents psychological stress recognition of 88 %.Thus, techniques of emotion recognition based only on analysis of glottal features have not been published and presented.
Compared to previously published methods of psychological stress recognition, the presented paper describes an innovative method based only on glottal pulse analysis in amplitude domain.Exactly, the main novelty of this work is to analyse mined glottal pulses as a two-dimensional shape or e.g.probability distribution.The fundamental idea in this paper is also based on the assumption that glottal flow is independent on spoken phonemes, which leads to provide experiments of stress recognition on real stress databases containing different languages to prove if glottal flow analysis, and in particular presented methods, can be successfully applied for psychological stress recognition independently of language and phonetic contents.

II. MATERIAL AND METHODS
This section presents and describes the introductory steps and options necessary for our observation of differences between glottal pulses in stressed and normal speech.The applied methods and glottal pulse estimation are based on using the software tool Aparat [15].

A. Used Methods
Many ways of glottal flow estimation exist, but algorithms based of inverse filtering techniques try to achieve the reliable estimation of glottal source wave.In other words, inverse filtering techniques remove the influence of vocal tract directly from speech.Due to reason of trying to obtain the most realistic glottal flow, estimation techniques based on inverse filtering are used in our experiments.The glottal pulses were estimated from the speech signal by two common algorithms [16] -the Direct Inverse Filtration (DIF) and the Iterative and Adaptive Inverse Filtering (IAIF), both applied on originally captured and normalized (NS) records at a vowels' beginning (VB) and centre parts (CP).Obviously, in our research, eight different methods analysing and estimating glottal pulses were applied.Each method is characterized individually as follows:  According to obvious differences between the used methods, the final comparison should uncover their suitability and efficiency for detecting psychological stress.

B. Used Database
Two different databases were used in the presented experiments.Firstly, 12 male Czech native speakers were randomly selected from the previously created database ExamStress [17] where the same speech is recorded during the final oral exams (stress influence) and a few days later (normal state) for each speaker.Due to this reason, the differences between normal state and real psychological stress can be observed.
Secondly, the SUSAS [18] database was used for validating the psychological stress detection efficiency on the English language and bad quality captured records containing high noise levels, voice distortion and signal clipping.Specifically, the part containing real psychological stress captured by 2 apache pilots almost out of fuel was used in the presented experiments.

C. Used Glottal Pulse Features
Differences between stressed and normal speech were observed and further classified by a vector of three glottal pulse features.In the first step, each mined glottal pulse is amplitude and length normalized to maximum values of 1 due to bringing the global pulse size into accord.Then, each normalized glottal pulse is divided into a series of pulse segments from the peak to n-percentage amplitude level which is shifted step by step along the amplitude axis.The 0 % level is at the top of glottal pulse, and the 100 % value lies at its bottom (see Fig. 1).Due to this fact, the used glottal pulse segmentation is called Top-To-Bottom (TTB).
The selected n-percentage pulse segment is further analysed in the time domain taking into consideration the opening and closing phase.Kurtosis α, skewness β and area γ are calculated for each wave part corresponding to To_p and Tc_p.Then, the Closing-To-Opening phase ratio (CTO) is where n is the actual n-percentage level, p substitutes one of the analysed parameters (i.e.skewness, kurtosis, area, respectively), Tc_p is the current closing phase value and To_p is the current opening phase value.Figure 2 shows the most illustrative example of glottal pulses in /u/ vowels' beginning estimated by the DIF algorithm for both states of one speaker.Here, given values of CTOs were averaged from 39 pulses.A similar CTO has been applied in previous research oriented on percentage segmentation of glottal pulses along the time axis [19].In this case, Gaussian Mixture Models (GMM) were evaluated as the most appropriate feature processing approach under six different classifiers.

III. EXPERIMENTAL RESULTS
This section describes realized experiments and achieved results.In the training process, 5 Czech vowels were spoken few times separately and automatically found [20] in recorded speech from 6 speakers in the ExamStress database to obtain reference CTO values in the vowels' beginning and center part by its averaging for all vowels.This fact means that for each speaker and each vowel there are 3 reference CTO values stored, leading to 90 reference values totally for each state of speaker and used method.These reference values are further used for training the GMM classifier.In the presented experiments, the GMM classifier standardly embedded in the Matlab environment is used and further is fitted on previously described reference values as a two-component Gaussian Mixture Model.Then, the binary decision (stress/no stress) is based on the higher probability reached for each state of fitted GMM to the investigated data.The flow chart of the algorithm used in our experiments is illustrated in Fig. 3. Fig. 2.An example of pulse differences varying on the speaker's state in glottal pulses estimated by DIF in /u/vowels' beginning for speaker 1 from the ExamStress database and 30 % selected interval with average CTOs.In our experiments, the investigated data, i.e. observed glottal pulses, were automatically extracted from fluent speech, exactly over all spoken phonemes, by using the flowing rectangular window of duration 300 ms with 50 % overlapping.Then, the estimated glottal pulses were normalized in two dimensions and were filtered for removing the parasitic pulses.
Investigated speech data was automatically achieved for all voiced parts of speech.The second group of 6 ExamStress speakers was used for testing the designed classifier.Due to the records having high quality and their length, approximately 1500 glottal pulses were analysed and classified for each speaker.
For investigating the language-independency of the presented methods, the SUSAS database was used.Compared to the ExamStress database, the low quality (e.g.voice distortion, clipping, loud background noise, etc.) of these records rapidly decreases the total number of estimated glottal pulses.It has been observed experimentally that processing only short parts (50 ms) of SUSAS records leads to satisfactory glottal pulse mining.Other lengths of analysed speech signals lead to estimating glottal pulses which do not match the Liljencrants-Fant model [21].All mined glottal pulses were filtered automatically, because incorrectly estimated glottal pulses occurred even for an analysed signal with short lengths.For each speaker in SUSAS, approximately 130 glottal pulses were received correctly and further used irrespectively of sound normalization and the vowels' parts for psychological stress detection.
The reached efficiency results using Method 1 and Method 2 are listed in Table I, where a few facts are evident.Sound normalization causes a decrease of psychological stress detection applied on the ExamStress database, but generally achieved efficiency is high and more than satisfactory.Contrary to previous statements, results achieved for the SUSAS database are high and more or less constant over the entire chosen n-percentage intervals for both methods which leads to much higher efficiency achieved than by using Method 2. These observations can lead to the statement that low quality records are less prone to sound normalization of testing sequences.
Table II shows the efficiency obtained by psychological stress detection based on the IAIF estimation algorithm and vowels' beginning (Method 3 and Method 4).
By comparing the results reached using the ExamStress database, the negative influence of sound normalization can be seen by a significant decrease of efficiency over all the observed n-percentage intervals.This effect is not that evident for the SUSAS database where almost all efficiencies are lower than its ExamStress equivalent (except the 80 % and 100 % level for Method 3 as well as 15 % and 55 % level for Method 4, achieving a stress detection efficiency of 95 %).
Obviously, psychological stress detection based on the IAIF estimation algorithm applied on the vowels' beginning is not appropriate on low quality records.
Further, the recognition efficiency was calculated for methods based on the vowels' centre part.
For DIF based methods (Method 5 and Method 6) and the ExamStress database, efficiency is more or less similar (over 90 %).However, for n percentage intervals higher than 50 %, efficiency slightly decreases to a value of 77 %.By applying Method 5 and Method 6 on the SUSAS database, similar efficiency is reached as for the ExamStress database and achieves high values almost over all the n-percentage intervals.Some exceptions can be found in the 45 %, 65 % and 80 % intervals, where both methods obtained poor and unsatisfactory efficiency.According to the made observations, the DIF glottal pulse estimation algorithm has been found to also be appropriate for psychological detection.The effect of sound normalization on stress recognition can also be classified as minimal as well as the effect of low quality records and spoken language captured on analysed records.Psychological stress detection efficiency results obtained by the IAIF estimation based on the vowels' centre part are described in the following text and are equivalent to the previously mentioned Table II.
As in previous cases (see Table I and Table II) for the ExamStress database, the IAIF estimation algorithm reaches generally lower recognition efficiency than the DIF algorithm, but it still gives satisfactory results almost on all n-percentage intervals.Applying Method 7 and Method 8 on the SUSAS database, the recognition efficiency generally sharply decreases by 20 % on average, except for 4 n-percentage intervals where it reaches much higher values than for the ExamStress database.
Apparently, as in the previous case, the IAIF algorithm is sensitive to analysed records quality, and in some cases also on spoken language.Methods based on the IAIF algorithm applied on the vowels' beginning are more suitable for psychological stress detection than Method 7 and Method 8.

IV. EFFICIENCY EVALUATION
To summarize the results listed in the previous section, it is necessary to make a final evaluation of the used methods and n-percentage intervals.Firstly, evaluating all investigated n-percentage intervals is appropriate for finding the most consecutive glottal pulse parts where the highest differences between normal and stressed speech occur.Obviously, the band of the best n-percentage TTB amplitude intervals lies between 5 % and 40 % where average efficiency ε reaches consequently higher values than 77.5 %.Table IV lists average efficiency ε values for each used method for both databases and all n-percentage intervals in the range from 5 to 40 % with a step of 5 %.As can be seen, the efficiency value depends on used methods, exactly on different vowel parts performed for training the classifier.Results listed in Table IV show that not so significant positive impact exists on reached ε over n-percentage intervals in the case of sound normalization.Obviously, similar ε results are achieved by similar glottal pulse estimation methods trained only on a varying vowel part.Finally, the highest average efficiency on the observed n-percentage intervals are reached by using the DIF estimation method (88.5 %) which achieved higher ε by a significant 15.2 % compared to the IAIF estimation algorithm (73.3 %).

V. CONCLUSIONS
According to all achieved results, it can be concluded that the DIF based methods give better stress detection, glottal pulse normalization is sensitive to the sound quality, and the vowel's part used for classifier training does not have a significant effect on recognition efficiency.The usage of the presented algorithms of glottal pulse processing estimated by DIF and applied on TTB n-percentage intervals from 5 % to 40 % can lead to high efficient psychological stress recognition in speech.Obviously by achieved high values of recognition efficiency (in some cases approaching 95 %), the presented technique could be classified as possibly text and language independent which can lead to further analysis of glottal flow in more detail to deploy it into real applications.
Nevertheless, in future work, it is necessary to verify the achieved results on other languages and expand the speaker database.

Fig. 1 .
Fig. 1.An example of glottal pulse n-percentage division where the selected closing phase is illustrated by dark grey and the opening phase is represented by a light grey colour.The thick line marks the chosen curve part of the glottal pulse.

Fig. 3 .
Fig. 3.The flow chart of the used psychological stress recognition algorithm.
energy, Manuscript received 30 November, 2015; accepted 3 May, 2016.Research described in this paper was financed by Czech Ministry of Education in frame of National Sustainability Program under grant LO1401.For research, infrastructure of SIX Center was used.

TABLE I .
THE EFFICIENCY OF PSYCHOLOGICAL STRESS DETECTION REACHED BY METHOD 1 AND METHOD 2.

TABLE II
Table III lists average efficiency values ε reached for each n-percentage interval for all used methods and databases.