Binary Quantization Analysis of Neural Networks Weights on MNIST Dataset

1 Abstract —This paper considers the design of a binary scalar quantizer of Laplacian source and its application in compressed neural networks. The quantizer performance is investigated in a wide dynamic range of data variances, and for that purpose, we derive novel closed-form expressions. Moreover, we propose two selection criteria for the variance range of interest. Binary quantizers are further implemented for compressing neural network weights and its performance is analysed for a simple classification task. Good matching between theory and experiment is observed and a great possibility for implementation is indicated


I. INTRODUCTION
Artificial neural networks (NNs) have become an attractive research field in recent decades for resolving different challenges due to the increasing availability of powerful hardware [1]. It is worth mentioning that the most significant achievements have been provided in tasks, such as image classification [2], object recognition [3], and speech processing [4]. However, the application in other fields has also been performed, where some promising results have been achieved [5]- [7].
Specifically, the improved performance (i.e., high prediction accuracy level) has often been provided using very complex NN architectures, with a large amount of parameters, computational and storage resources. This in turn can be a limiting factor for the application of NNs in portable and edge computing devices with limited memory and processing power, or in latency-critical services. Hence, the need for NN compression is evident and quantization is a widely used approach for that purpose. In that case, NN parameters (weights, activations, etc.), usually represented in 32-bits floating point format (full precision), are mapped to fixed-point representations using lower bit lengths.
The compression of NN parameters and various challenges have been observed in many research papers, where different codewords have been used, whose lengths are 8 bits [8], 4 bits [9], or 2 bits [10]. In addition, even lower representations using ternary [11] and binary [12]- [17] quantization have been considered, where a significant compression ratio accompanied with the competitive accuracy level have been offered by the quantized NN. Hence, binary quantization takes an important role in the compression of NN parameters and deserves to be explained in detail from the view of both signal processing and NN performance. This kind of analysis is supported in this paper and it has not been conducted in previous works [12]- [16]. Regarding the recently published paper [17], where a comprehensive analysis of the binary quantizer, including adaption and application, has been provided, here we deal with the design of a fixed (non-adaptive) quantizer for a wide dynamic range. This is important as the same quantizer can be used for different Laplacian inputs. Note also that Laplacian distribution can describe well the weights of NN [9], as well as speech [18]- [20]. In particular, we derive closed-form expressions for performance estimation and introduce two criteria for the selection of the quantizer for the variance range of interest. Theoretical results are further verified on real data using the weights of NN observed for the handwritten digit classification problem. The influence of binarized weights on prediction accuracy is also investigated and the relation between the weight quality (measured by SQNR) and accuracy is established, which has not been done so far.
The rest of the paper is organized as follows. In Section II, the design method for the optimal quantizer with respect to distortion is given. In Section III, the analysis in a wide dynamic range is provided in detail and criteria for selecting the quantizer for the defined variance range are proposed. In Section IV, the experimental results obtained by implementation in neural networks are summarized and discussed. Finally, we conclude the paper in Section V.

II. DESIGN OF BINARY QUANTIZER FOR THE REFERENCE VARIANCE
Let us consider a symmetrical binary (N = 2 levels) scalar Binary Quantization Analysis of Neural Networks Weights on MNIST Dataset quantizer presented in Fig. 1. With α, the representational level in the positive range (the level in the negative range is simply reflection of the positive one) is denoted. Next, x max denotes the maximal data limit, where α = x max /2. Let the input data source be described by the Laplacian probability density function (PDF) given by [18] where σ 2 is the variance of the data. If we adopt the unit variance as the reference one (σ 2 = σ 0 2 = 1), denoting the standard approach in scalar quantization [18], then PDF takes the following form Given an input data source, the distortion in the case of symmetrical binary quantizer can be evaluated as  Figure 2 shows how x max affects the SQNR. The commonly used criterion for the quantizer design is the maximal SQNR (or equivalently minimal distortion) [18]. Given results, the required criterion is accomplished for x max = 1.4142 (α = 0.7071). This can also be verified using the following lemma. Lemma 1. The value of x max of Laplacian binary quantizer optimized in terms of distortion is specified as Proof. Finding the first derivative of the distortion with respect to x max and further equalling it to zero results in max max 1 0.
From the last equation we obtain max 2,  x which concludes the proof. Based on Lemma 1 and relation among the quantizer parameters, it holds that α opt = 1/ 2. Note also that for x max = 2 (i.e., α = 1) we obtain the qunatizer widely used in NN applications [12]− [16], providing 0.7 dB lower SQNR than optimal one.

III. DESIGN OF BINARY SCALAR QUANTIZER FOR A WIDE DYNAMIC RANGE
In this section, we consider the situation when a binary quantizer (designed for the particular variance) is applied on the Laplacian inputs having a variance different from the designed one. This is known as variance-mismatched quantization [18], [21]. It is familiar that variance-mismatch effect reduces the efficiency of the quantization model over the broad variance range. Hence, robust quantization models are recommended for non-stationary data processing, as they can satisfy minimal quality requirements over the entire range. Here, we will analyse the binary quantizer in a wide dynamic range and derive expressions for performance evaluation. In addition, criteria for selection of a binary quantizer in the established variance range of interest will be proposed.

A. Derivation of Expressions for Performance Evaluation
To evaluate the performance of the binary quantizer in a wide dynamic range of the input data variances, we use PDF defined with (1). Hence, we estimate the distortion as where α(σ 0 ) = α and x max (σ 0 ) = x max denote the values of representation level and maximal data limit in the case of variance σ 0 2 (see Section II). Next, considering the previous expressions, SQNR is given by Figure 3 plots SQNR (10) in the variance range (-10 dB, 25 dB) with respect to σ 0 2 = 1, when x max = 1/2, x max = 1, x max = 2, x max = 2, and x max = 4. It can be observed that all SQNR curves attain the same maximum (same as the optimal quantizer in Section II), but the SQNR does not retain the constant value in the rest of the variance range and it rapidly decreases. Accordingly, the quantizer robustness is low and the efficiency on non-stationary data is limited. It is also important to discuss the impact of parameter x max on the design approaches presented here (wide dynamic range) and in Section II (particular variance). While in the approach in Section II selection of non-optimal x max value (x max ≠ 2 ) causes the degradation in SQNR (see Fig. 2), here it causes shifting the curve left or right from the one with optimal value of x max (x max = 2 ). Note also that each SQNR curve attains its maximum at different variance points, which is defined with the following lemma.
Lemma 2. Given variance range and parameter x max , the binary quantizer attains the maximum SQNR at the point specified as Proof. Let us define the function F as Taking the first derivative of F with respect to σ results in  (14) or in terms of α which concludes the proof. By replacing (14) in (12) and taking the logarithm (base 10), we obtain SQNR = 3.01 dB, the same as in Section II.
In addition, we will show that SQNR and distortion attain their extreme values (maximum or minimum) at different variance points. Figure 4 shows the distortion (9) (16) and is different from the one defined in (14). The corresponding values of σ for both optimization functions, SQNR and D, in the case of different x max are given in Table  I.
In the following subsection, we provide the criteria for selecting the best quantizer (i.e., the appropriate x max value) either for a particular variance and a range of variances having the width smaller than 35 dB.

B. Criteria for Selection of the Binary Quantizer
Firstly, we consider scenario when the best quantizer from the set of quantizers (i.e., the ones with different x max ) needs to be selected for the particular variance in the defined variance range, observing SQNR as a performance criterion. Thus, by direct observing Fig. 3 for the variance defined in the point 0 dB (σ = 1 in the linear domain), the best quantizer is the one with x max = 2 achieving the SQNR of 3.01 dB (as indicated in Section II). On the other hand, for the variance points, e.g., 15 dB (σ = 5.62) and 20 dB (σ = 10), the binary quantizer with x max = 4 is the best since it provides SQNR of nearly 3 dB and 1.25 dB, respectively, and outperforms the other observed quantizers.
The following two criteria are proposed for selecting the best binary quantizer for the variance range of interest.
The first criterion proposes the selection of the quantizer such that the maximal average SQNR (SQNR av ) is achieved in a defined (fixed) variance range  (17) where m is the number of observed variances σ i taken from that fixed range. With the second criterion, we want to emphasize the importance of robustness. Thus, besides taking into account SQNR av , the best quantizer has to fulfil one additional condition in the given range min SQNR SQNR 1dB,  where SQNR min defines the minimal SQNR that should be achieved in the desired range. In other words, if in the defined range the quantizers achieve SQNR av values that are very close, then the best quantizer will be chosen the one providing the widest interval where criterion (18) is fulfilled.
From the theoretical SQNR curves in Fig. 3, it can be shown that the width of the range where condition (18) is fulfilled is equal and amounts to approximately 17.6 dB. Furthermore, the borders of that range denoted as (σ min , σ max ) for each curve (obtained for different x max ) can be calculated as solutions of the following equation  (19) and are provided in Table II.  (18) is fulfilled. Using Tables III and IV, the basis for the application of the second criterion is provided. Namely, the results show matching with the first criterion, as the same quantizers are chosen for the established variance ranges.

IV. EXPERIMENTAL RESULTS AND DISCUSSION
The goal of the section is to verify the theoretical analysis provided in previous Section III by applying a binary quantizer in processing the weights of NN. In addition, we will investigate the influence of binarized weights on NN performance, measured by prediction accuracy [1].
Our experiment is focused on the feedforward neural network named the "multilayer Perceptron" (MLP) [1]. This is a classical network and it is composed of input, hidden, and output layers. We use MNIST database [22] as input, having 60.000 monochrome images of handwritten single digits of dimension 28×28 pixels, where for training and testing purposes, 50000 and 10000 images, respectively, are used. Accordingly, MLP is used for classification tasks; the number of nodes in the input, hidden, and output layers is 784 (28×28), 128, and 10 (the number of classes), respectively. Rectified Linear Unit (ReLU) and softmax activation functions are used in the hidden and output layer, respectively. In addition, the regularization rate, learning rate, number of iterations per epoch, and batch size are set to 0.01, 0.0005, 468, and 128, respectively.
The MLP NN is trained for 20 epochs achieving the prediction accuracy score of 96.7 % (in this case, the weights are represented using 32-bit floating point format (full precision)). The histogram of the learned weights is depicted in Fig. 5. Given the figure, one can note that the weights can be approximated with Laplacian PDF of variance σ w 2 and mean μ w (in our case, μ w tends to zero), providing the basis for implementation of the considered binary quantizer (post-training quantization will be performed).
The efficiency of the quantizer on the real data is measured using the SQNR ex that (assuming the zero-mean) is defined as where D w is the distortion obtained by binarization of weights, W is the total number of weights, w i is the original, and w i q is the quantized value of weights.
The correctness of the theoretical results (in terms of SQNR) is investigated on the range of [0 dB, 18 dB] in relation to the reference variance that is set to be one of the original weights (σ w 2 ). Thus, in Table V, the SQNR ex values for some selected points from the observed range are summarized, considering the quantizer with different values of x max . For illustration purposes, we plotted in Fig. 6 the SQNR ex versus the variance of the data (weights). The selection of the best quantizer based on experimental results will be done using the criteria proposed in Section III. In Table VI, the values of the average SQNR ex (SQNR av ex ) and the width of the interval, where SQNR ex is higher than 1 dB (denoted by ), achieved in the range under question by different binary quantizers, are listed. It can be perceived that the first criterion proposes a binary quantizer with parameter x max = 4, while according to the second criterion, the best quantizer is designed using x max = 2. Note also that the quantizer designed using x max = 2 has been the theoretical choice for both criteria for the considered range (see Tables III and IV) and matching of the theoretical and experimental results is observed in that case.
In addition, the weights (original weights, as well as the weights having variance different from the original one) quantized using a binary quantizer with various x max are then separately implemented to MLP for classification purposes on test data (10000 images from MNIST database [22]) and the prediction accuracy is examined. This corresponds to the situation when the same (non-adaptive) binary quantizer is used for different MLP networks (as the set of weights is different in each case). The accuracy scores are provided in Table V, where some interesting conclusions can be derived. Observe that for a given binary quantizer defined with x max , each MLP achieves the same prediction accuracy score, although different SQNRs are provided. This is because the same quantized weights are obtained regardless the variance of the weights (the weights are quantized to the values -x max (σ w )/2, x max (σ w )/2) and thus the quantized MLP is the same. Accordingly, in that case, the relationship between the SQNR and prediction accuracy cannot be uniquely defined (i.e., SQNR does not dominantly contribute to neural network performance).
On the other hand, in Table V, we can see that the accuracy of the quantized MLP increases as the binary quantizer uses higher values of x max . This can be explained as follows. As x max increases, the distance among the representational levels increases, enabling better classification and higher accuracy scores. The highest performance of the quantized MLP is achieved when quantizer with x max = 4 is applied (92.24 %) and slightly lower when the quantizer with x max = 2 is applied (91.93 %). Note that these two quantizers are already proposed as the most appropriate based on SQNR analysis performed above. Further increasing of the parameter x max (x max > 4) will result in negligible increasing of MLP performance.
Finally, one can notice that MLP with binarized weights provides a lower accuracy score for 4.46 % (x max = 2) or for 4.84 % (x max = 4) than that achieved with full precision weights, at the same time reducing the network size by a factor 32.

V. CONCLUSIONS
In this paper, a detailed analysis of binary scalar quantization of Laplacian source has been carried out along with an application for compression of NN parameters. Closed-form expressions in terms of SQNR and distortion have been derived for analysis in a wide dynamic range, and two criteria have been proposed to select the best quantizer. Verification of the theoretical results in terms of SQNR achieved in a wide dynamic range and quantizer selection has been done on real data using NN weights. Furthermore, the selected fixed (non-adaptive) binary quantizer has been applied to compress different MLP networks (whose coefficients follow the Laplacian PDF, but have different variances) with a goal to establish relationship among the SQNR and prediction accuracy. It has been shown that each quantized MLP is the same regardless of the weight variance (i.e., the same prediction accuracy has been achieved), although for different weights different SQNRs have been observed. Therefore, the uniquely defined relationship has not been established as SQNR does not dominantly contribute to NN performance. In addition, the relatively high prediction accuracy has been reported (over 92 %), that is only 4.46 % lower than the full-precision model, along with a compression gain of 32 times.

CONFLICTS OF INTEREST
The authors declare that they have no conflicts of interest.