A Novel Fuzzy Optimized CNN-RNN Method for Facial Expression Recognition

1 Abstract —Facial


I. INTRODUCTION
Expression is one of the most important ways to convey emotions in interpersonal communication. It mainly refers to the formation of facial muscles and facial features. Common expressions include anger, disgust, fear, joy, surprise, etc. [1]. People can get their emotions directly through facial expressions. In addition, automatic facial expression analysis has been widely applied in social robots [2]- [5], medical treatment [6], driver fatigue monitoring [7]- [9], and many other human-computer interaction systems [10]- [13], and facial expression recognition has a very important research value.
In general, face expression recognition methods are mainly divided into three steps: face detection, feature extraction, Manuscript received 26 April, 2021; accepted 22 July, 2021. and classification recognition, and feature extraction is the key to ensure face detection [14]. The existing feature extraction methods mainly include traditional methods and deep learning methods [15]. Traditional methods for facial expression recognition usually use manual features or shallow learning, e.g., local binary mode [16], local binary patterns from three original planes, LBP-TOP [17]. However, the traditional machine learning method is difficult to meet the needs of the current era due to its long time of processing data, lack of intelligence, and weak performance [18]. Since 2013, the emotion recognition competitions, such as FER2013 and Emotiw, have collected relatively sufficient training data from challenging realistic scenes, which has virtually promoted the transformation of facial expression recognition from laboratory control to real environment [19].
With the rapid development of graphics processing unit (GPU) computing power of hardware, it is relatively easy to train large deep learning networks [20]. Therefore, the deep learning technology is increasingly applied to face expression recognition in the real world [21]. In terms of facial expression recognition, the deep learning algorithm does not rely on complex image preprocessing and does not need to design accurate manual features [22]. Compared with traditional machine learning algorithms, the deep learning algorithm performs better and has better robustness in the face of illumination, various postures, and occlusion problems. Hu et al. [23] used a convolutional network to complete facial expression recognition and detect facial key points through multi-task learning. In the multi-task training, the bottom layer of convolutional neural network (CNN) network is shared, and the layer that needs to be shared is determined by automatic learning, so the model cannot be affected by the distribution of data sets. To avoid complex feature extraction process and data manipulation problems involved in traditional facial expression recognition, Jiang and Learned-Miller [24] based on the Faster recurrent neural network (RNN) model in the field of target detection to recognize facial expressions. Firstly, the face expression images are normalized, and then the model convolutional network extracts the features and reduces dimension of feature maps. The candidate regions are generated by the region candidate network and detected by FERT-RNN. Finally, the expression recognition classification and border coordinates are given by Softmax classifier and regression 1. Most networks convolve directly on the expression image without enhancing the image, which reduces the detection accuracy of the network when the image distortion occurs; 2. Most of the current network stacks use simple convolution, ignoring the characteristics of the figure and the problem of information loss in the process of convolution step by step. Recent literature shows that CNN can be combined with other algorithms to achieve better recognition performance [29]. However, very limited work has been done to address this issue. To this end, this study proposed a novel combination of CNN-based facial recognition method, which increased the image preprocessing module in the CNN structure by the double linear quadratic interpolation to keep the detail characteristics of the image; in addition, the affine transformation increased the number of datasets to enhance the generalization ability of the RNN structure. Finally, the Fuzzy optimized CNN-RNN structure is used to train and recognize facial expressions in the public database to obtain better recognition results. The contributions of this work are summarized as follows.
 For the first time, a novel CNN combination model is proposed for facial recognition, which can improve the convolution process in the CNN operation;  For the first time, the Fuzzy logic is employed to optimize the CNN-RNN model, which can improve the efficiency and effectiveness for facial recognition.

A. Overview of the Fuzzy-CNN-RNN Method
To solve the problem that the original CNN is difficult to make effective use of the global situation and information in the classification process, a new network structure is constructed in this paper. The network structure mainly includes four parts: a) The preprocessing part, which mainly makes the image zoom and affine transformation; b) In the feature extraction part of CNN, the pre-processed facial expression images are extracted by convolution; c) By adding Fuzzy-RNN module to realize the application of global information and improve the detection accuracy; d) Full connection layer, through which the prediction is realized. The network framework is shown in Fig. 1. In Fig. 1, the original is mainly scaled by bilinear quadratic interpolation method. Image scaling cannot only reduce the image to shorten the training time, but also expand the facial expression data set by scaling the image at different multiples. Then the affine transformation of the scaled image is carried out to realize the data set expansion. Finally, the image is input to the improved convolutional neural network for classification. The main modules of the improved convolutional neural network in this paper are shown in Table I. The network mainly consists of five modules. The first module is C1, which is the convolution layer. The convolution features of the first layer are obtained through 5*5 convolution, and the obtained convolution features of the first layer are 32*44*44, and the second layer is S2, and the feature data are reduced through the maximum pooling. The third layer is the convolution of 3*3, and the size of the feature map is 64*20*20. The feature map is reduced through the maximum pooling layer of the fourth layer, and the classification is finally achieved through Fuzzy-RNN.

B. Dataset Preprocessing
In the task of facial expression recognition by computer vision, it is very necessary to preprocess the dataset. The preprocessing of this paper mainly includes the clipping of facial expression image and the expansion of facial expression image.
1. Image resizing Image clipping is mainly to adjust the size of the image to fit the input of network Settings. It is a very common preprocessing method in the field of deep learning. Different from the linear interpolation in [30] and [31], in this work, the bilinear interpolation method is used to adjust the size of the image, which can keep the details of the image features to ensure the integrity of the face image and to avoid the identification accuracy reduction.
The principle of bilinear interpolation is shown in Fig. 2. The four adjacent grayscale values at (X, Y) are used to calculate the grayscale values at (X, Y). Let four adjacent pixels be A, B, C, and D, and their coordinates are (i, j), (i + 1, j), (i, j + 1), (i + 1, j + 1), and their grayscale values are g(A), g(B), g(C), and g(D), respectively. First, calculate the grayscale values g(E) and g(F) at points E and F: Then the gray value at (X, Y) is calculated by The above bilinear difference method is used to zoom the facial expression image. The zoomed expression image is shown in Fig. 3.

Image expansion
The introduction of a large number of training data is to ensure the accuracy of network identification [32]. However, many studies directly use them for network training after scaling, which limits the generalization ability of the network to a certain extent, resulting in a low detection accuracy in the case of face data distortion. To solve the above problems, the affine transformation of the image is carried out in the process of preprocessing. It can not only increase the number of training data sets, but also improve the generalization ability of the network by training data sets from different angles.
The Affine Transformation [33], [34] is a linear transformation method of two-dimensional coordinates, which can maintain the straightness and parallelism of two-dimensional graphs at the same time. In other words, affine transformation allows the image to be inclined at will and can be contracted at will in two directions. It is applicable to the cases of horizontal shift, rotation, scaling, and inversion, and can be expressed by (4): In the formula, (t x , t y ) represents the translation amount, and the parameter a i reflects the size of the image rotation and scaling amount. (x', y') represents the image after transformation, and (x, y) represents the image before transformation. After affine transformation, the facial expression image is shown in Fig. 4, where a1, a2, a3, and a4 are 1, 0, 0, and 1.5, respectively, and the translation is (0, 0).

Loop structure
At present, to improve the detection accuracy of neural network, people use the deep neural network model [35], but lack a deepening convolution neural network features to solve the figure low utilization ratio and disappearing problem. In this paper, the traditional CNN model is optimized by combining CNN with Fuzzy-RNN for facial expression recognition.
First, RNN has a powerful ability to capture contextual information in sequences [36]. It is more effective to use contextual clues for image-based sequence recognition than to process each symbol independently. Secondly, RNN can backpropagate errors to the convolutional layer so that the network can be trained end-to-end [37], [38]. By introducing the RNN module into the CNN model, the convolution feature can be recycled [39]. On this basis, not only the redundant features extracted from the network can be reduced, but also the disappearance of features can be avoided by effectively utilizing the convolution feature. Thirdly, the Fuzzy logic is powerful to manage the uncertainty in the feature map [40]; hence, the Fuzzy-RNN is able to significantly improve the recognition performance on the facial expression.
RNN was first used in natural language processing, where the entire sentence is defined as sequential data and each word is based on the understanding of the previous word [41], [42]. When an artificial neural network performs natural language processing, it needs a structure to reason about the next word based on the context of a sentence that combines the previous output as the input for the inference. RNN is a series of neural networks used to process sequential data. The RNN structure is shown in Fig. 5. In Fig. 5, O t is the output unit, X t is the given input convolution feature sequence, and h t is the hidden layer unit. A one-way flow of information from the input unit reaches the hidden unit, and another one-way flow of information from the hidden unit reaches the output unit. h t is calculated based on the output of the current input layer and the state of the previous hidden layer h t-1 , as shown in (5) [37], [43] where f represents the nonlinear activation function, such as tan or Relu, and U and W are shared parameters. O t is the output of step t, which depends on the activation function of the current neuron, as shown in (6) where σ represents the activation function of the output layer. Due to stochastic nature of the CNN deep learning process, the uncertainty often exists in the obtained feature map. To relieve the uncertainty effect, the Fuzzy logic is used to map the highly nonlinear relationship between different features. By doing so, the uncertainty of features can be reduced to allow the Fuzzy-RNN to start the circular convolution from the input sequence feature map so that the face expression feature map can be used more efficiently to realize the utilization of the global information in the context of the feature map. The details of the Fuzzy-RNN theory can refer to [44]- [46].

A. Training Method
The experimental environment is carried out on the deep learning server with GPU Tesla P100, the operating system is Linux, and the version of Python2.7 is used. Firstly, the image size of the data set was processed into 48×48 pixels, and the batch size was set to 128 according to the GPU video memory in the training process. The training cycles of the three data sets were 60, 60, and 250, respectively. The momentum gradient descent method was adopted in the optimization algorithm. The initial learning rate is 0.01, the momentum is 0.9, and the weight attenuation is 5×10 -4 . The overfitting can be prevented and the model generalization performance can be improved by weight attenuation.
During the training process, after each Epoch is completed, the model will be evaluated and saved through the test set. After all training rounds, the weight value with the best recognition effect will be saved into the final model file. The training results of the proposed method are shown in Fig. 6. As can be seen, the training accuracy gradually increases and is close to 100 % after 3,000 epochs, while the entropy loss gradually decreases and is close to 0 after 3,000 epochs. There are no sudden fluctuations in the accuracy and entropy loss curves during the training process, indicating that the training of the proposed method is stable and reliable.

B. Evaluation Index
Cross-entropy loss function: In the classification task, the cross-entropy error is often used as a loss function [46]- [49]. The cross-entropy error is shown as follows log , kk k E t y   (6) where log represents the natural logarithm with E as the base, y k is the output of the neural network, and t k is the correct solution label. Accuracy rate: Accuracy rate is defined as the evaluation index of network performance, and the calculation method of the accuracy rate is In the formula, P is the network accuracy rate, TP is the number of correctly classified images, and FP is the number of wrongly detected images. By calculating the above detection accuracy rate, the performance of different networks is compared and analysed.
C. Data Set Description CK+, Jaffe, and FER2013 datasets from the open dataset were used for training and testing. All the datasets contained seven facial expressions, including anger, neutral, disgust, scared, happiness, sad, and surprised. CK+ dataset was extended from the Cohn-Kanade database and contained 327 tagged facial videos. The images used in the experiment were extracted from the last three frames of each sequence. A total of 981 pictures with seven facial expressions were selected. JAFFE dataset contains seven expression images of 213. In this study, the images of the human face will be processed by the bilinear interpolation, and then the affine transformation will be used to rotate the images' angles five times for data expansion. FER2013 facial expression dataset contains 35,886 facial expression images and is a common dataset for facial expression recognition competitions. The three kinds of facial expressions are shown in Fig. 7.
There are many wrong labels in the data samples of FER2013 data set, and some samples are not positive facial expression pictures and face occlusion. FER2013 data set is much higher than CK+ and Jaffe data sets in terms of scale and recognition difficulty. In the experiment, the recognition rate of FER2013 private test set was selected as the model recognition accuracy, and the final sample grouping of the three data sets was shown in Table II.

A. Comparison of Different Models
In the network model, Resnet18, Resnet50, Resnet50 + congenital pulmonary airway malformations (CPAM) and CNN-RNN were selected to compare the recognition effect, and the test performance index was made through the recognition rate and the number of model parameters. The experimental results on the two data sets are shown in Table  III. Compared with Resnet18 or its improved model and the separate deep learning model, the Fuzzy optimized CNN-RNN model has improved gender recognition, and the recognition rates on CK+, Jaffe, and FER2013 data sets are 99.22 %, 96.64 %, and 72.81 %, respectively. This is because by introducing the RNN module, the Fuzzy optimized CNN-RNN can make full use of the convolution feature to avoid the loss of context information. Combined with the expanded data set, the detection accuracy is improved to a certain extent.
By deepening the network depth, the RESNET model has improved the recognition performance of the training model to some extent, but the number of model parameters has increased too much, and the effectiveness is getting lower and lower. Resnet50 increased by 13.87M in the number of parameters compared with Resnet18, and the recognition rate was only improved by 1 %, 1 %, and 2 %, while the Fuzzy optimized CNN-RNN constructed in this paper improved by 6.29 %, 6.98 %, and 2.71 % compared with Resnet50 with the increase of 2.99M in the number of fewer parameters. Compared with Resnet50 + CBAM, the increase of 1M fewer parameters improved the recognition rates by 2.25 %, 3.05 %, and 1.81 %, which proved the feasibility of the proposed method.
Compared with CNN and Fuzzy-RNN network models, the identification of CK+ and Jaffe has a certain improvement, but not significant, while the identification of FER2013 database has a significant improvement. At this time, the increase of the number of parameters is not significant, which also conforms to the statement in [24] and further proves the effectiveness of the method presented in this paper.

B. Comparison of CNN-RNN with other Methods
The various expression recognition rates of the Fuzzy optimized CNN-RNN on CK+ and FER2013 data sets are compared with those of other methods, as shown in Table IV and Table V. As can be seen from Table IV and Table V, the Fuzzy optimized CNN-RNN method has some deficiencies in the judgment of fear expressions on CK+ and has achieved good recognition effect on the other 6 expressions. In FER2013 data set, the recognition effect of happy, surprise, and disgust are good, and there are some discrimination errors in anger, fear, and sadness, because there are similarities between these three expressions, while neutral and sad expressions have little change in facial features, which increases the difficulty of recognition. In general, compared with other literature in recent years, the Fuzzy optimized CNN-RNN method achieves a better recognition rate, which proves the effectiveness of the Fuzzy optimized CNN-RNN structure for facial expression recognition. As can be seen from Table IV, the Fuzzy optimized CNN-RNN method on CK+ database is 2.71 % higher than that proposed by Gan, Chen, Yang, and Xu [51] for embedding single-pooled channel attention module in the convolutional layer. Compared with the feature fusion reclassification network model proposed by Zhang, Huang, and Tian [53], which extracted the original image and Local Binary Pattern (LBP) feature image through two VGG network channels respectively, the reclassification network model achieved better recognition performance. The method of adding attention and the method presented in this paper still maintain the similar recognition results.
As can be seen from Table V, the total recognition rate of the Fuzzy optimized CNN-RNN method in the Fer2013 dataset is 1.22 % higher than that of the convolutional neural network enhanced in the preprocessing stage proposed by Khemakhem and Ltifi,[50] and 0.7 % higher than that of adding the course learning strategy in the process of facial expression recognition training proposed by Liu and Zhou [52]. Both methods enhance feature learning at different stages of facial expression recognition. The bilinear interpolation and the attention mechanism were used to enhance the key feature of the images with more datasets than the traditional linear interpolation. As a result, the residual network integration in the present model can avoid the network depth degradation problem, which has the advantages over the RARNet model.
Lastly, we would like to mention possible applications of the proposed facial recognition method. The first application is for the intelligent cell phones; the second application is in the intelligent house equipment. Moreover, there are still many other applications of the facial recognition systems, including the social robots, medical treatment driver, fatigue monitoring, and human-computer interaction systems.

V. CONCLUSIONS
In this paper, a Fuzzy optimized CNN-RNN method is proposed for facial expression recognition. The recognition performance of this new method is compared with many existing popular methods using major testing datasets. The comparison demonstrates that the proposed method has improved the recognition rate of different facial expression datasets and performs better than existing CNN-based method. The max improvement can reach to 3.4 %, which is a good indicator to show the advantage of the proposed Fuzzy optimized CNN-RNN method. However, due to the influence of mislabeling, non-positive expression, occluded expression, and low-quality expression, there is still a lot of room for improvement in the facial expression recognition. The next step will be to further study the attention mechanism algorithm and Fuzzy-based uncertainty reduction.

CONFLICTS OF INTEREST
The authors declare that they have no conflicts of interest.