Multi-resolution Feature Extraction Algorithm in Emotional Speech Recognition
In this paper a new approach for recognizing emotional speech from audio recordings is presented. In order to obtain the optimum processing window width for feature extraction and to achieve the highest level of recognition rates, a trade-off between time and frequency resolution must be made. At this point, we define a new procedure that combines the advantages of narrower and wider windows and takes advantage of dynamic adjustment of the time and frequency resolution of individual feature characteristics. To achieve higher recognition rates two major procedures are added to the multi-resolution feature-extraction concept, one being the exclusion of features calculated on different processing window widths and the other the idea to use only the parts of recordings with most explicit emotions. To confirm the benefits of the algorithm the audio recordings from the emotional speech database Interface along with four different classifiers were used in evaluation. The highest level of emotion recognition rate with multi-resolution approach exceeded the recognition rate of the best single-resolution approach by 3.5 % with the average improvement of 1.5 % in absolute terms.
Authors retain copyright and grant the journal the right of the first publication with the paper simultaneously licensed under the Creative Commons Attribution 4.0 (CC BY 4.0) licence.
Authors are allowed to enter into separate, additional contractual arrangements for the non-exclusive distribution of the paper published in the journal with an acknowledgement of the initial publication in the journal.
Copyright terms are indicated in the Republic of Lithuania Law on Copyright and Related Rights, Articles 4-37.