multi-Extraction of Novel Features Based on Histograms of MFCCs Used in Emotion Classification from Generated Original Speech Dataset

1 Abstract —This paper introduces two significant contributions: one is a new feature based on histograms of MFCC (Mel-Frequency Cepstral Coefficients) extracted from the audio files that can be used in emotion classification from speech signals, and the other – our new multi-lingual and multi-personal speech database, which has three emotions. In this study, Berlin Database (BD) (in German) and our custom PAU database (in English) created from YouTube videos and popular TV shows are employed to train and evaluate the test results. Experimental results show that our proposed features lead to better classification of results than the current state-of-the-art approaches with Support Vector Machine (SVM) from the literature. Thanks to our novel feature, this study can outperform a number of MFCC features and SVM classifier based studies, including recent researches. Due to the lack of our novel feature based approaches, one of the most common MFCC and SVM framework is implemented and one of the most common database Berlin DB is used to compare our novel approach with these kind of approaches.


I. INTRODUCTION
Human-computer interaction systems have been drawing attention increasingly in recent years.Understanding the emotions of humans plays a significant role in these systems, since human feelings provide a better understanding of human behaviours.Furthermore, in order to increase the accuracy of recognition of the words spoken by human, many of the state-of-the-art automatic speech recognition systems are dedicated to natural language understanding.Emotion classification has a key role in performance improvements for natural language understanding.The other areas, in which an emotion classification system can be used are as follows: voice search tagging, word search with specific emotions, and emotion based advertisement placement [1].
In this study, MFCCs are calculated for all audio files in both of the utilized databases.Then, these are classified based on the type of emotions.In [2], Plutchik claims that emotions are categorized as the Primary Emotions and Secondary Emotions.Primary emotions are anger, fear, sadness, disgust, surprise, anticipation, trust, and joy.In this study, emotions of sadness, happiness, and neutral can be recognized by our designed system.We focused only on these three emotions as the amount of the train data is generally not large enough for the remaining ones to arrive at statistically robust conclusions.There are two main contributions in this study.One is our novel feature, which is MFCCs representation based on their histograms and other contribution is PAU speech data, whose emotions are labelled and cross-checked by PhD students.
Section II covers academic studies related to this paper.In Section III, experimental framework and its steps are elaborated.Section IV mentions our novel feature and classical MFCCs feature of academic literature in detail.Section V describes speech data and their characteristics.Finally, Section VI exhibits the experimental results and Section VII draws conclusions.

II. LITERATURE SURVEY
Various types of classifiers have been used for the task of speech emotion classification: Hidden Markov Model (HMM), Gaussian Mixture Model (GMM), Support Vector Machine (SVM), Artificial Neural Networks (ANN), k-Nearest Neighbors (k-NN), and many others.In fact, there has been no agreement, which classifier is the most suitable one for emotion classification.It also seems that each classifier has its own advantages and limitations.
Many recent studies show that DNN based approaches outperforms SVM in many areas, such as image, speech, and text studies within abundant data [3].In recent papers [4]- [6], these two R&D groups independently have established closely related DNN architectures with multi-  [7], [8].Therefore, we cannot implement DNN due to the limited data.
In study [9], the authors have leveraged MFCC for extraction of features and multiple Support Vector Machine (SVM) as a number of classifiers.Their extensive experiments are based on happiness, anger, sadness, disgust, surprise, and neutral emotion sound database.Performance analysis of multiple SVM reveales that non-linear kernel SVM achieves greater accuracy than linear SVM [10].As the authors mention, their best performance on Berlin DB is 75 % accuracy.
Dahake et al. [11] has two main contributions: one is feature extraction using pitch, formants, and MFCC, and the other is to improve speaker dependent SER by comparing the results with different kernels of SVM classifier [12].The highest accuracy is obtained with the feature combination of MFCC +Pitch+ Energy on both Malayalam emotional database (95.83 %) and Berlin emotional database (75 %), tested using SVM with linear kernel.
In [13], three emotional states are recognized: happiness, sadness, and neutral.Explored features include: energy, pitch, Linear Prediction Cepstral Coefficients (LPCC), MFCC, and Mel-Energy spectrum Dynamic Coefficients (MEDC).Berlin Database and self-built Chinese emotional databases are used for training the specified classifiers.
In [14], the basic emotion comparing speech features are being recognised.The authors use similar methodology with the study in this paper to recognize emotions.However, their database and features for recognition are quite different from ours.
In order to combine the merits of several classifiers, aggregating a group of them has also been recently employed [15], [16].Based on several studies [17]- [22], we can conclude that SVM is one of the most popular classifiers in emotion classification probably because it had been widely used in almost all speech applications up to 2012.As shown in Table I [23], the average success rate of SVM for speech emotion classification is in the range of 75.45-81.29%.
In [24], Kamruzzaman and Karim report on speaker identification for authentication and verification in security areas.This kind of identification is mainly divided into textdependent and text-independent approaches.Even if many studies utilize the text-dependent approach based on a variety of predefined certain utterances, this study employs a text-independent methodology.Basically, the implementation part of this study is composed of feature generation and classification.MFCC coefficients are calculated as a foundation of our informative features and SVM utilizes these features in order to classify the speech data.
In [25], Demircan and Kahramanli extract MFCC's from the speech data obtained from Berlin Database [26] (Berlin Database of Emotional Speech, 2014).Seven statistical values are calculated from the MFCC: minimum value, maximum value, means, variance, median, skewness, and kurtosis.Using those values, k-Nearest Neighbor algorithm is used to classify the data.Their contribution is to reduce the dimension of the data to 7 different values.

III. EXPERIMENTAL FRAMEWORK
In order to carry out various experiments to show the performance of our novel emotion classification feature, we elaborate a framework with details.The steps of this emotion classification framework (Fig. 1) are as follows sequentially.

A. Collect Speech Data
Collecting speech data plays a significant role in speech recognition studies due to the lack of comprehensive speech data.Therefore, speech data collection constitutes a major part of this study.The details of data properties and how to generate them are explained in Section V-C.

B. Preprocessing
Due to the fact that noise in speech breaks down speech data, removing outliers plays a significant role in the stateof-the-art classification system.In order to filter them out, Interquartile range method of John Tukey [27] is employed.Furthermore, min-max normalization is employed in feature wise for the sake of removing out the high variance sensitivity on features.

C. Feature Extraction
The extraction of suitable features that efficiently represent different emotions is one of the most important issues in the design of a speech emotion classification system.A proper group of features significantly affects the classification results, since pattern recognition techniques are rarely independent of the problem domain.In this study, MFCCs are selected as a group of features.More specifically, in the first feature, the first and second derivation of average MFCCs and the average of them are calculated.As the second feature, which is our novelty, weighted values of MFCCs combining MFCCs values and their corresponding Probability Density Function's (PDF) values.In the third feature, concatenation of the first and second features is leveraged to get higher performance.

Mel-Frequency Cepstrum Coefficients (MFCC)
MFCCs are calculated based on the known variation of the human ear's critical bandwidths with frequency.The main point to understand speech is that the sounds generated by a human are filtered by the shape of the vocal tract, including tongue, teeth, etc.This shape determines what sound comes out.If the shape is accurately determined, this should result in an accurate representation of the phoneme being produced.The shape of the vocal tract manifests itself in the envelope of the short time power spectrum, and the purpose of MFCCs is to represent this envelope accurately [1].
In order to get a statistically stationary mean of data, the audio signal is divided into 25 ms of frames.If the frame is too short, it may not be possible to have enough samples to get a reliable spectral estimate.If it is too long, the signal changes too much throughout the frame.Each frame can be converted into 12 MFCCs plus a normalized energy parameter.The first and second derivatives (Delta and Delta-Delta, respectively) of MFCCs and energy can be calculated as extra features resulting in 39 numbers representing each frame.However, the derivation of the MFCC parameters is generally implemented when the original MFCC does not provide the necessary amount of information that leads to a good classification.
The MFCC algorithm steps are shown in Figure 2. SVM is a supervised machine learning classifier technique used primarily for large databases to categorize new samples.The algorithm searches for the optimal hyperplane, which separates different classes with maximum margin between them.The libSVM [17], a scholarly accepted support vector library, is used to train and test the dataset.The data is separated into two parts -90 % for training and 10 % for testing.On the training part, the validation sets for each fold are generated using 10-fold cross validation methodology.A Gaussian radial base function kernel is used to classify data, since it gives better approximations on data.The best SVM parameters C and gamma (γ) are obtained using 10-fold cross-validations on train dataset with validation data.Those parameters are determined using a mesh-grid search over the values suggested by [28].

E. Software Toolbox
LibSVM [28] library for Matlab is used for SVM routines.Matlab's TreeBagger class is utilized for RF classification.MFCC library of Wojcicki for Matlab [29] is used to calculate MFCC.

F. Algorithm
In the main part, firstly all datasets are acquired to calculate the MFCCs for each individual file.Then, the first, second, and third features of each file are calculated using MFCCs values for each element.More elaborately, each file is divided into the number of 25 ms. of frames.Then, MFCCs are calculated for each frame.After calculating the MFCCs, average value and their corresponding the first and second derivatives are counted.Then, a histogram of each MFCCs is created dividing to 10 equal distant bin for each MFCCs in min-max range.These counts of histograms are divided by total count to get the PDF of MFCCs.Then, in order to leverage PDF value and corresponding MFCC value, these two values are multiplied for each of the MFCCs PDF.Finally, all MFCCs values, their average, and first and second derivatives of each MFCCs are stored for each frame.At the end of the file, the histogram and PDF are calculated using each frame of MFCCs.The Covariance matrix and a label vector for the output emotion classes are generated by SVM.After the SVM analysis, Accuracy and Confusion Matrices are calculated as a mean value for all iterations.
In SVM analysis part, train and test data are randomly selected.Then, 10-folds cross validation is performed on the train data.Accuracy results of SVM prediction are obtained by using the best parameters resulting from the cross validation.

IV. MFCCS BASED FEATURE VECTORS
In this study, 12 coefficients of MFCC + the energy of each frame are calculated for each individual's audio file [29].The details of the MFCC are explained in Section III-C1.For the feature extraction, three features are generated using MFCC.These features are as follows.
1. Feature Set 1 Average of MFCCs, its Delta (first order derivative) and Delta_Delta (second order derivative): The average of MFCCs is calculated for all frames of each speech data.Delta and Delta-Delta (first and second derivatives) are calculated by subtracting the consecutive frames and consecutive Deltas correspondingly.

Feature Set 2
Weighted MFCCs values wrt (with respect to) their probability distribution: The PDF of each coefficient are calculated building the histogram of each of the MFCCs of all frames.During this calculation, different value interval for each MFCC is obtained considering min-max values of them.The second feature is calculated by the multiplication of values in this interval and corresponding PDF values: [ , ], 1, 2 ...13, where ai and bi are min and max values of each of MFCCs.
In that case, ci is the internal value within [ai, bi].As shown in Fig. 3, ci discrete feature value and pdf(ci) are nonnormalized probability values.In (2), "*" operation is the element-wise multiplication.In this case, we encode the histogram just multiplying these two values.So, only a number of bins data is used to represent the histogram.
Otherwise, bin values and corresponding probability values must be used separately to describe the histogram.Thanks to this approach, the number of features is decreased, while the computational performance is increased because of halving the size of histogram representation.

Feature Set 3
Concatenation of Feature Set 1 and Feature Set 2: In this feature set, Feature Set 1 and Feature Set 2 are assembled without any modification on both features.

V. MULTIPLE DATABASES
The details of databases utilized in this study are as follows: 1.The Berlin Database: This is a database frequently used by emotion classification researchers, which contains speech data in German language [23], [31].Burkhardt et al. [26] show the details about the Berlin Database.
2. The PAU Database: We have collected English speech samples from YouTube video collections and videos of popular TV shows.
Figure 4 and Figure 5 illustrate histograms of the length of the audio files for the Berlin Database and our custom database, respectively.Bins of the histograms represent audio file length in seconds.Total number of files is 312 for the Berlin Database, and 320 for the PAU database.Total time for the Berlin database is 16 minutes, and for PAU -10 minutes.

A. Database Features
In this study, genders (male & female) of the associated individuals are noted as database metadata.Also, age categories are classified as "Young" (age between 12 and 30) and "Mature" (age between 31 and 60).Sadness, happiness, and neutrality are chosen as target emotions to predict.Audio files are in wav format and their duration varies from 1 to 9 seconds.Acted and neutral speech types are also available.

B. Labelling
Labelling the audio file plays a significant role in categorization of the data.In this study, all speech data are labelled with gender, emotion, and age data.Table II compares both databases according to their features.

C. PAU Database Generation
The PAU database is produced from the sources described in Table III by 4 (male) students, who are doing their PhDs in computer and electrical engineering departments.All speech data are inserted into the PAU database after the independent control steps.In this control step, each member checks other members' data sets also, which must be consistent with their corresponding label.It took nearly three months to collect and process the data, which is approximately 102 MB in size (the database files will be provided free of charge to the academic and research community).

VI. EXPERIMENTAL RESULTS
The database consists of 632 audio samples in total.Experiments are conducted for the German Berlin database, PAU English database, and a combination of both.For each case, train and test data are selected from their own datasets.
The number of audio files per database is shown in Table IV.

VII. CONCLUSIONS
Even though DNN has better results (performs better) than SVM, in this study, SVM is carried out as a classifier because of the lack of huge size speech data.
Better results were obtained, because of distributions of all MFCCs have more information to represent the emotion rather than using only average of MFCCs.This novel feature provides smaller size of data for histogram representation and requires less computational power.We can clearly conclude that using this feature has two main advantages: feature representation size and computational cost.
Best results are achieved by the Berlin Database compared to PAU (English) database because the sentences for the speech in Berlin Data are the same for each individual and they are performed in the same framework as well (in studio environment).Procedural preferences during the speech, such as stressing words, mood, and mouth gesture, are almost the same.
As shown in Table V and Table VI, we have an approximately 8.5 % decrease of accuracy for the English database (Table VI) compared to Berlin Database (Table V) because the sentences in every sample are quite different from one another for the former database.Furthermore, some additional noise resulting from the environment of speech has a great impact on audio files.All Berlin speech data are generated in indoor studio environment, while our database has different environment speech utterance.Therefore, the procedures of data generation are quite different from our methodology.As a conclusion, we should note that our framework for audio generation is more appropriate for the real-life conditions.Our study has better results than average classification accuracy of SVM for the speech emotion classification studies.The accuracy results obtained by SVM on PAU database for the first, second, and third feature are 70 %, 71 %, and 73 %, respectively.Those numbers are 75 %, 78 %, and 81 % for Berlin Database.The results obtained are the average accuracy results of 60 runs.Those results support that the third feature helps us to obtain a better classification result.

Fig. 2 .
Fig. 2. Block diagram of the MFCC Algorithm [1].D. ClassificationA speech emotion classification system consists of two stages: (1) feature extraction from the available (speech) data and (2) classification of the emotion in the speech utterance.In fact, the most recent researches in speech emotion classification have focused on this step.A number of advanced machine learning algorithms have been developed for many different research areas.On the other hand, traditional classifiers have been used in almost all proposed speech emotion classification systems[23].In this study, SVM is used to classify speech utterances by optimizing and training data set and presenting performance results on the test sets.SVM is a supervised machine learning classifier technique used primarily for large databases to categorize new samples.The algorithm searches for the optimal

TABLE I .
[23]SIFICATION PERFORMANCE OF POPULAR CLASSIFIERS FOR THE SPEECH EMOTION CLASSIFICATION[23].

TABLE II .
COMPARISON BETWEEN THE BERLIN AND PAU EMOTION CLASSIFICATION SPEECH DATABASES.

TABLE III .
PAU EMOTION CLASSIFICATION SPEECH DATABASE DETAILS.

TABLE IV .
DISTRIBUTION OF EMOTIONS OF DATABASES.

Table VI ,
[26]Table VII, are the average accuracy results of 60 runs.More specifically, all experiments are repeated 60 times.The peak (non-average) accuracy result obtained during the tests was 95 %.One of the models used in the paper [13] by Yixiong et al. consists of MFCC + MEDC + Energy triple.That model has the highest accuracy rate (91.3043 %) among all their models on the Berlin Database, but it is not clear, whether that is a peak accuracy or a mean accuracy.In[26], Burkhardt et al. did not mention how to separate train and test data.Their best neutral, happiness, and sadness recognition rates are 88.2 %, 83.7 %, and 80.7 %, respectively, while ours are 84.8 %, 85.29 %, 88.5 % for the third feature in the Berlin Database (in German).The results reveal that our features results in better performance for identifying emotions of happiness and sadness.

TABLE V .
EXPERIMENTAL RESULTS FOR BERLIN DATABASE.

TABLE VI .
EXPERIMENTAL RESULTS FOR PAU DATABASE.

TABLE VII .
EXPERIMENTAL RESULTS FOR COMBINED DATABASE.