Towards Speaker Identification System based on Dynamic Neural Network

The conventional, Finite Impulse Response and Lattice-Ladder multilayer perceptron (MLP) structures with 4, 8 and 16 hidden neurons were verified for speaker identification. The experiments were performed on 10 speakers, 3 Lithuanian words, 7 sessions’ database. Identification performance was compared against two baseline methods: Vector Quantization (Linde-Buzo-Gray) and Gauss Mixture Models (Expectation Maximization). Increase of neuron number in hidden layer has led to smaller mean square errors on training dataset. A Finite Impulse Response MLP showed smaller mean square errors values. The results of experimental investigation show that neural networks can be used for speaker identification system as they outperform baseline methods. The best identification rate was archived by a multilayer perceptron with 4 hidden neurons and Finite Impulse Response MLP with 8 hidden neurons. DOI: http://dx.doi.org/10.5755/j01.eee.18.10.3066


I. INTRODUCTION
Human-machine interaction gets more popular every day and probably the most acceptable method for human being interaction is speech.There are numerous reasons stimulating research in this area starting with increased efficiency via saving human resources in customer support business areas to user-friendliness in controlling domestic appliances, cars, mobile phones and computers.As some of the mobile gadgets may not have enough processing power some custom designed hardware or FPGA usage could be a solution [1].
Human speech carries not only a plain text message, but also a lot of very distinct information such as: language being spoken, physical and emotional states, accent as well as identity of the speaker.A human being always tries to consider all this information before taking action, which is really not an easy task for a machine [2].As it was already demonstrated, Lithuanian has specific phonetic, syntactic and lexical properties along with specific accentuation.Therefore, a machine we want to control also should work taking in to account other information contained in speech.The most important task after speech recognition is to determine the identity of the speaker as it would let to ignore the commands given by people, who should not have the authority to work with the device.This could prove useful in systems described in [3].
Therefore, speaker identification also receives a lot of research effort.However, neither optimal set of features nor the most suitable classifier has been agreed on.The experiments usually show different results depending on speaker database used [4], which can be influenced by different recording hardware, surroundings, language, and speaker origin and so on.
In this short paper we compare two implementations of classical baseline methods -VQ-LBG [5] and GMM-EM [6] against three artificial neural networks -MLP [7], FIR MLP [8] and Lattice-Ladder MLP [8] presenting experimental results of their use for speaker identification.Our aim is to incorporate national speech aspects in system thus Lithuanian words uttered by mother-tongue speakers were used for the experiments.

A. Signals and features used
The records of 10 speakers in 7 sessions pronouncing 3 Lithuanian words -"turėti", "nebūti" and "mokykla"are used in the experiments.These words have been chosen as they include all vowels of Lithuanian language.Accentuation of these words is also specific: first two have the second syllable stressed, while the third has its last syllable stressed.Selected words do not include diphones.There exists 1542 diphones [9] and it would be very difficult to cover them all.
The sampling frequency of the records is 11 kHz, the records are saved in 16 bit PCM format as "WAV" files.Recordings have been done in silent environment using personal computers.The stationary noise including the interference of electric mains and other noises produced by non-professional equipment were removed by Wiener filter as 10 s of pure silence containing stationary noises has been recorded before pronouncing the words.
Mel-frequency cepstrum coefficients' (MFCC) feature space was selected to be used for transformation of the signals.The amplitudes of pronounced words were normalized making the maximal value of the signals equal.The signals were framed taking 256 samples per frame (23.22 ms) with an overlap of 100 samples (11.03 ms).The frame energy and zero-cross-ratio were calculated.If threshold values of the frame or its 10 neighbours from each side were not exceeded the frame was dropped.20 MFCC were extracted from each of the frames.The first coefficient was discarded resulting in 19 MFCC used to form a feature vector.
MFCC vector contains only information within one frame and does not give any information on changes from previous frames.In order to take dynamic information into account special classifiers such as HMM based are required.The other common way is to form delta or even delta-delta coefficients by subtracting MFCC vector values of previous frame from the values of the current frame.

B. Types of artificial neural networks selected
Artificial neural networks are investigated in order to see how well various structures can evaluate dynamics of MFCC vectors and perform speaker identification task.
FIRMLP structure is considered to be more powerful than MLP due to capabilities to process time-dependent signals.LLMLP structure outstands because in it is easy to track the stability of the filter during the training procedurethe stability of lattice-ladder filter is guaranteed if absolute value of any lattice coefficient does not exceed 1.

C. Method used for dataset construction
The available data is separated into 3 groups.3 sessions have been used for training, 1 for validation and 3 for testing.MFCC feature vectors from the 3 words of one session are combined into one set of feature vectors for each speaker.Further, the data of different speakers are combined into one for each session.Afterwards, the resultant vectors from sessions 1, 2 and 3 are combined into training dataset, from session 4 into validation dataset and from sessions 5, 6, 7 into testing dataset.
A separate network for each of the speakers is constructed, so 10 networks must be trained separately.10 different desired output signals must be formed for each of the network.The desired output is set to "1", for the MFCC feature vectors of speaker we want to identify, whereas feature vectors of other speakers are marked by "− 1" in the desired output signal.A problem arises in such construction, because the ratio of feature vectors belonging to the speaker to be identified and other speakers is 1 : 9.This causes a bias of the pattern classification and unseen feature vectors of a true speaker are classified incorrectly more often.The problem is solved by using the same signals each shifted by ten samples for extraction of additional feature vectors.Data composition of one session is graphically depicted in Fig. 2. As a result a different training, validation and testing datasets for each of the 10 networks is composed.

D. Method used for neural network training
The training of the neural networks is performed changing  training parameter using Levenberg-Marquardt training algorithm   where k w -weights' matrix at k-th instance; J and I -Jacobian and identity matrixes; eerror vector.The optimization criterion is mean square error (MSE): with d (i) and o (i) as desired and actual network output; Fthe total number of feature vectors.The standard algorithm requires a change, because it tends to overshoot and to make filter unstable while µ is small (performing similar to Newton algorithm).The stability of the filters is tested and µ is multiplied by µ inc in case of instability.Increasing the µ Levenberg-Marquardt algorithm performance become more steepest-descent like.
The sigmoid activation functions are used in hidden and output layer neurons, the initial ladder coefficients are initialized randomly whereas lattice coefficients and biases are set to 0, µ = 0.001, µ inc = 5, µ dec = 0.15.Training is stopped if any of criteria is met: number of iterations is more than 25; MSE reaches 10 −6 , gradient value drops below 10 −4 , or µ gets greater than 10 16 .Values of filter coefficients are saved after each iteration and the ones with the smallest MSE value on validation dataset are taken.
In order to be consistent in the comparison, 1-st order for FIRMLP and LLMLP filters was chosen.
Finding the global minimum of MSE is not guaranteed and the outcome depends on initial values of the trained network coefficients.Thus the experiments have been repeated 10 times, where only the best solution with the smallest validation error has been chosen, for each of 10 networks for each speaker.Analysis of 3 different types and 3 different structures led to 900 experiments in total.

) VQ method with Linde-Buzo-Gray training algorithm (VQ-LBG), calculating sum of minimum Euclidean distances from 16 centroids;
2) GMM trained by EM algorithm (EM-GMM) using 16 mixtures.Unfortunately, the smallest validation error does not guarantee the best network performance on the testing dataset as it can be seen in Fig. 3a, where minimum validation error is at iteration 5, minimum testing error at iteration 9. Mostly, this has been seen during training of the networks for the first speaker identification.Other networks showed smaller shape differences between validation MSE and testing MSE curves (Fig. 3b).This could be explained by bigger differences in feature vectors of Speaker 1 speech sessions.In this case, taking coefficients at iteration 9 instead of 5 would reduce MSE of testing set by more than 34 %.However, testing set is believed to be data unseen by the network during learning procedure and probably the possible solution would be to collect more data from Speaker 1 for validation.
A closer look at results of mean square errors for training datasets given in Table I reveals that increasing number of hidden neurons decreases the errors in all three types of networks (values are in grey background) with the exception of the LLMLP networks for Speaker 6 identification (value underlined).Comparing the performance of the network types with the same number of hidden neurons shows marginal advantages of FIRMLP networks with an exception for Speaker 8 identification underlined by wiggled line.The demonstrated performance of the LLMLP structure networks seems to suffer from the curse of dimensionality which probably could be solved by initializing the ladder and bias coefficients from the FIRMLP structure and letting the learning process to adapt the lattice coefficients initialized as zeros.

B. Generalized results
Unlike MSE values on training dataset, values produced by testing dataset seem to have no trends.Speaker identification results are given in Table II.The given values show how many mistakes the networks have done identifying the speakers in a closed set test.The signals from session 5, 6 and 7 were used.This results in 3 words for each of the ten speakers (30 per session).The speaker identification was performed feeding each word into the networks and the decision from the outputs of the networks has been done in two ways: where op(i)the i-th value of output of p-th artificial neural network; Fthe total number of frames in utterance;     b

Fig. 1 .
Fig. 1.A universal representation of artificial neuron used in hidden layer for structures of: a) MLP, b) FIRMLP, c) LLMLP.

3 .
Mean square errors of the first order FIRMLPs for different speaker identification: a) Speaker 1, b) Speaker 10.
Fig.2.A dataset formed from one session data for each of the 10 speakers.Each block consists of MFCC feature vectors extracted from 3 words.White blocks depicts a feature vectors extracted from the same shifted signal produced by speaker to be identified, gray blocks depicts feature vectors of other speakers.

TABLE 1 .
MEAN SQUARE ERROR VALUES OF VARIOUS NEURAL NETWORK STRUCTURES ON TRAINING AND TESTING DATASET.

TABLE II .
SPEAKER IDENTIFICATION ERRORS.