Convolutional Neural Network Feature Reduction using Wavelet Transform

Paper describes wavelet transform possible application for convolutional neural networks (CNN). As it already known, wavelet transform gives good signal representation in time and frequency domains. This can be useful for CNN input feature reduction as well as architecture simplicity by using only part of coefficients. The result of work is set of experiment which enables to configure out the most appropriate coefficient part. After feature reductions and architecture simplicity achieved configuration could classify data almost ten times faster than original.


I. INTRODUCTION
Artificial neural networks have found application in wide range of modern technologies such as image and voice recognition, natural language processing, speech synthesis, ect.However, neural network model of multilayer perceptron (MLP) allows to decide majority of classification problems it has some drawbacks.First of all it doesn't take into account a two dimensional spatiality of input data [1]- [7].This is main disadvantage for classifying images and other 2D data.The decision of this problem comes from biology visual cortex model where neurons are grouped into receptor fields.This allows analyze data spatially and define features of it.The convolutional neural network represents such model [1].
Convolutional neural networks (CNN) have different architecture form MLPs, which definitely complicate learning process [3], [6].The learning process directly depends on features count, which for image is its height multiplied by width in pixels.The author offers apply 2D discreet wavelet transform (DWT) for input data, therefore reducing number of features with saving performance of CNN.Next sections will describe CNN architecture and its properties.set of small receptive fields.In general CNN architecture is shown in Fig. 1.

II. CNN ARCHITECTURE
As it can be expected input data is splitted into several features maps.It is performed by using different 2D kernels and convolution operation.The result of convolution defines features of input data, such as corners, curves, lines ect.Feature occurrence is mirrored on feature map by corresponding receptive field location, defining features map.After each convolution layer follows subsampling layer.It performs data reduction for future analysis by reducing each map dimension in four times.Every subsampling layer is followed by convolution layer.In this case connections between layers defines interconnection matrix.
This process repeats until features maps become too small, e.g.1x1.After this feature maps is followed by full interconnected MLP which output is classifiers vector.
The idea of performing subsampling layer is shown on Fig. 2. Hence W 1  CNN signal forward propagation can be described by following set of equations.Matrix notation can be helpful for these purposes.
Suppose there is input data which is represented by X matrix and W nm is kernel matrix, where m is kernel number of n-th convolution layer (1).The result of convolution operation is set of features maps M nm , where m feature map number of n-th convolution layer where k, l is feature for m-th map.
Since features maps are formed it is necessary apply activation function for each feature ( 2) where x is feature.Activation function limits feature value, so that it is in range from -1 to +1, this is necessary for stable CNN performance and upcoming computations.
After each convolution layer follows subsampling layer, which maps dimension is reduced in four times.This can be expressed by Using above notation (1) can be rewritten for each upcoming convolution layer The computation of ( 3) and ( 4) repeats until feature maps becomes 1x1.Then row of features is connected to full interconnected MLP which output corresponds to classifier.
Such architecture of neural network gives additional advantages over MLP.It has immunity for input data invariance, which is often exposed for images.Convolution layers gives immunity for data shifts and deformation, however subsampling layers allow don't take into account data scale.These advantages are used for determining and recognition hand written symbols, which often are shifted and deformed [1], [5], [7].
However introduced architecture is quite complicated compared to MLPs it will have less trainable parameters or weights for same performance.It can be described by using shared weights which are represented by kernel matrices.

III. WAVELET TRANSFORM APPLYING FOR CNN
How it can be expected, sometimes data may be redundant.To decide this problem in more convenient way with saving data behaviour wavelet transform can be used.As well wavelet transform was successfully used in image compression algorithms like JPEG2000 [2].
In CNN case features are data and wavelet transform is useful for it reduction.For this purposes wavelet with Haar basis most suitable, as it has simple set of filters [4] expressed by H and G in (5).To perform feature transform to coefficient the expression (5) should be applied to input data ' , where ( 6) and , (7) here A is data approximation coefficients; D h is data horizontal details coefficients; D v is data vertical details coefficients; D d is data diagonal details coefficients [4].Therefore, after applying transform only part of coefficients can be passed to CNN.This may reduce not only features number but as the result simplify CNN architecture.For input data features submatrices A, D v , D h , D d may be used.But as it will be realized in experiments the most appropriate results, with the same misclassification rate as original, will give approximation matrix A.

IV. USING FEATURES REDUCTION FOR CNN FASTER PERFORMANCE
This section will involve achieved method in real example of CNN use.For experimental purposes LeNet5 architecture of CNN will be used [1].This architecture was successfully used for hand written digits recognition.
For such method implementation it is necessary to train CNN with samples set.This set is formed from couple of fonts which includes Latvian characters, as well as noisy samples are included.Although with training set, test set is used together.Test set includes fonts which are not used in samples.It gives ability to check CNN performance at learning stage.
For reference CNN was trained by usual 28x28 pixel samples.The learning plot and performance is shown at Fig. 3.As it can be expected Fig. 3(a) Root Mean Square Error (RMSE) for this case is 1.25 Fig. 3(b) and misclassification rate is 8.33% of test set.These parameters are defined after learning curve converges.RMSE corresponds to output classifier vector, but misclassification rate shows wrong classified symbols rate from test set.
For such CNN configuration data forward propagation time for each symbol is about 250 ms, but learning process takes 1 h 55 min.These times are used for reference; however they may depend on computation abilities.After reference computation is performed, the set of experiments is made.
For first experiment, ( 5) is applied for each of symbols.After performing transform, CNN is trained by such samples.How it can be noticed in Fig. 4(b), performance becomes even worse.This can be described by sample feature redundancy, what makes CNN unstable.In this case learning time as well as forward propagation time doesn't changes.
It implies new approach for using transformed data.As it can be expected in (5) transformed data consist of four parts, which includes approximation and detail levels coefficients.It can be appropriate to use one of the parts for CNN input features, but it is necessary to change CNN architecture for such samples.If input data will be quarter of original samples, then input layer should acknowledge 14x14 data structure.Therefore CNN input layer size should be decreased in four times.As the result, CNN internal layer number is decreased as well.It can be described by features maps earlier degradation.
After performing a bunch of experiment it was evaluated, that the most suitable part for CNN input is approximation coefficient part A, which gives better performance than others.It can be described also by intuition of using smooth copy of original samples.For given CNN configuration learning time was 8 min 16 sec and data forward propagation time decreased to 27 ms, which is dramatically smaller compared to original LeNet5.But as it can be noticed test set RSME and misclassification rate is 1.5 and 16.7% respectively.This is limit for such architecture configuration, which is caused by CNN reduction and internal feature maps low count.
For getting better performance, the original LeNet5 first convolution layer maps count was extended from 6 up to 12.After learning process completed, it gives new performance plot, which is shown on next figure.
The last configuration of CNN architecture gives same misclassification rate as the first one, but learning time becomes 9 minutes instead 1 hour 55 min, as well data forward propagation time dropped from 250 ms to 32 ms.These results are quite exciting and shows suitable application of wavelet transform to CNN.

V. CONCLUSIONS
The achieved results of this work shows advantage of applying wavelet transform for neural network feature reduction, which causes it architecture reduction.This results lower latency for data forward propagation as well as learning time becomes faster.In this particular example wavelet transform is applied to convolutional neural network.The next statements describe actual results of work: 1) Original LeNet5 learning time and data forward propagation latency was 1 h.55 min and 250 ms respectively.
2) For modified LeNet5 architecture with applied wavelet transform to input data, learning time and data forward propagation latency becomes 9 min and 32 ms respectively.
Both configurations have same misclassification rate, what makes second one preferred.
AND PROPERTIES CNN consist of multiple convolution and sub sampling layers, where convolution and subsampling layers change each other.Each of the layers includes several features maps, which are connected to previous layer maps through Manuscript received March 19, 2012; accepted May 16, 2012.
and W 2 is 2D kernels, which coefficients in this learning process of CNN.However kernels size and number can be various and it depends on CNN developer preferences at design stage.