Modified Backpropagation Algorithm with Multiplicative Calculus in Neural Networks

1 Abstract —Backpropagation (BP) is one of the most widely used


I. INTRODUCTION
The backpropagation (BP) algorithm was first proposed by Werbos in his doctoral dissertation in 1974 [1]. The thesis introduced the algorithm in the general context so that it was not widely known and publicised in the neural network (NN) community until the mid 1980s. The algorithm was then independently presented by Rumelhart, Hinton and Williams [2], Parker [3], and Lecun [4], but it became popularised by Rumelhart and McClelland [5]. After the publication of this study, there was a huge explosion of scientific work in the field of NNs. Most multilayer NNs have been trained by the BP algorithm.
Although the BP algorithm has numerous advantages in many successful applications, there have been some drawbacks such as the BP needs a lot of training time and therefore the convergence tends to be significantly slow [6] and the learning process is stuck at a local minimum [7]. This study addresses a novel solution to overcome the slow convergence rate problem and avoid local minima problem, and this paper introduces a modified BP algorithm with multiplicative calculus. To my knowledge, this is the first study on the use and application of multiplicative calculus in Manuscript received 15 February, 2023; accepted 10 May, 2023. this field.
The proposed algorithm uses the multiplicative derivative of the activation function in the NN architecture. The proposed algorithm is the modified BP algorithm that still performs a backward pass while adjusting the learnable parameters (weights and biases). The difference from classical one is the use of multiplicative form of the derivative in the computations and also the use of new derivative form on only the last layer not in the hidden layers.
Some linearly inseparable problems are studied through shallow NN models with the proposed algorithm, the performance of the algorithm is analysed in different case studies, and the obtained results are provided comparatively with the classical BP algorithm. It is shown that the proposed modified BP algorithm converges quickly to the optimal solution compared to the classical one. The initial values of the adjustable parameters are changed and the potential local minima convergence is created for the classical BP algorithm, and then the proposed algorithm is implemented through the same network and it is indicated that the proposed algorithm avoids the local minima problem.
The rest of the paper is presented as follows. Section II gives an overview of the classical BP algorithm. In Section III, the multiplicative calculus description is introduced. Section IV describes the proposed algorithm, and the performance analyses are discussed in Section V. Finally, Section VI provides the conclusions of the paper.

II. AN OVERVIEW OF BACKPROPAGATION
In feedforward NNs, when one of the inputs x in the training data set comes to produce an output y, the initial information flows to the hidden units in each layer, and this process is called "forward propagation". In training NNs, forward propagation lasts until it produces a cost (or error). The backpropagation (BP) algorithm simply enables the information flow from the cost and then propagates back through the network to compute the gradient [8]. The BP algorithm to train the feedforward NN is summarised in Algorithm 1 [9].
As illustrated, the BP algorithm requires differentiable activation functions to carry out gradient computations. To implement the computations of the gradient at a specific layer, the gradients of all layers are merged via the chain rule of calculus.

III. MULTIPLICATIVE CALCULUS
There are two main tools in calculus: the derivative with differentiation operation and the integral with integration operation. The process of finding a derivative is called "differentiation" and the process of finding an integral is called "integration". Those two operations are fundamental and essential operations of calculus. The differentiation and integration are infinitesimal versions of the subtraction operation and addition operation on numbers, respectively. The close connection between the derivative and the integral was first observed independently by Newton and Leibniz in the second half of the 17 th century [10].
It is known that many physical quantities in nature are of an exponentially varying type. As an alternative to classical calculus, multiplicative calculus was proposed for exponentially varying functions [11]. Multiplicative calculus introduces new kinds of derivative and integral forms in division and multiplication forms rather than addition and subtraction. This alternative calculus was extensively described in [12]- [14], with use on real-valued functions [15], and its extension to complex-valued functions [11], [16].
The multiplicative derivative of a function f (x) Similarly, the multiplicative integral of a function In these definitions,  indicates the positive real numbers and ln(f(x)) represents the natural logorithm of the given function. It is known that the classical derivative Comparing the multiplicative derivative equation with this classical definition of derivative, we observe that the and also the division by h is replaced by raising the reciprocal power 1/h. Additionally, the multiplicative derivative and the classical derivative are related by (4) and (5):

IV. THE PROPOSED ALGORITHM
The proposed algorithm is a modified version of BP for training neural networks (NNs). The algorithm still makes backward passes during update processes of adjustable parameters.
The novelty and improvement of the proposed algorithm is the use of multiplicative form of the derivative in the computations rather than the classical derivative. The type of sigmoid function is selected as the activation function forms on the layers of NN but the new derivative form is applied only on the last layer not in the hidden layers. The sigmoid function is defined by the following formula [(6)] 1 . 11 x xx e ee    (6) The natural logarithm of the sigmoid function is calculated as given in (7), below 1 ( ) Then the classical derivative for the natural logarithm of the sigmoid function is evaluated as given in (8) and (9), below:   Finally, the multiplicative derivative of the sigmoid function is formulated by (10), below The proposed algorithm is now described as in Algorithm 2.

V. RESULTS AND DISCUSSION
To analyse the performance of the proposed modified backpropagation (BP) algorithm, various cases of linearly inseparable problems are examined through some shallow neural network (NN) models.
 Case I: Training XOR classification task with two input nodes, one output node, and one hidden layer of two nodes. The size of the training data set is four. The XOR, or "exclusive OR", classification problem is a classic and common problem in NN research areas [17]. The two-input XOR problem is shown in Table I. Minsky and Papert [18] first showed that it was impossible for a single-layer perceptron network to solve the XOR classification problem. In this experiment, a shallow NN with two input nodes, one hidden layer with two nodes, and one output node is used. The architecture is demonstrated in Fig. 1. Each node in the hidden layer and the output layer includes a bias term.
In the first phase of the experiment, the network is trained with the standard BP algorithm. The sigmoid function is chosen as the transfer function in each node unit. Next, the network is trained with the proposed algorithm. The weights and biases are initially selected as 0.5 for each node as follows (11) The learning rate is chosen as 0.2.    In the second phase of the experiment, the learning rate is increased and chosen as unity -1, but the weights and biases are initially selected the same as in the first phase of the experiment, 0.5 for each node.
The corresponding performance comparison is given in Fig. 3. The results of the training the XOR classification task demonstrate that the proposed modified BP algorithm provides faster learning than the classical BP with given problem specifications and used network architecture.
 Case II: Training a linearly inseparable classification task with three input nodes, one output node, and one hidden layer of four nodes. The size of the training data set is four. 57 In this experiment, a linearly inseparable problem shown in Table II is chosen. To solve the classification problem, the NN is designed with three input nodes, one hidden layer with four nodes, and one output node. The NN architecture designed is represented in Fig. 4. All nodes in the hidden layer and output node do not involve bias terms in this case.  In the first stage of the experiment in Case II, the network is trained with a standard BP algorithm. The sigmoid function is chosen as the transfer function in each node unit. Next, the network is trained with the proposed algorithm. The learning rate is chosen as 0.1 in both stages. To keep the equality in analysing both algorithms, classical BP and the proposed algorithm, the weight matrices for the hidden layer and the output layer are selected the same as given by (15) and (16) Figure 5 shows the average training errors of the proposed algorithm and the standard BP.
In the second stage of the experiment in Case II, the learning rate is changed and selected as one while the initial weights and biases are chosen equally as in the first phase of the experiment. The performance of the standard BP and the proposed algorithm are indicated in Fig. 6.  From the experimental results in Case II, it is clear that for the given linearly inseparable problem, the proposed algorithm outperforms the classical BP considering the convergence speed to minimum loss.
 Case III: Training a linearly inseparable classification task with three input nodes, one output node, and one hidden layer of four nodes. The size of the training data set is eight.
In Case III, another inseparable problem with the size of eight training pairs is selected as given in Table III. To solve the given problem, the same NN architecture is used as in Case II.
In this experiment, the network performance is analysed both for the standard BP algorithm and for the proposed algorithm. The sigmoid function is set as the transfer function in each node unit. The learning rate is chosen as one in both algorithms, and the weights are initially selected same as in Case II. The resulting average training errors of the standard BP algorithm and the proposed algorithm are presented in Fig. 7.
The results obtained from the experiments conducted in Case III specify that the proposed algorithm shows superior performance compared to standard BP.
As stated above, one of the important pitfalls of BP is the tendency to a slow convergence. Taking into account the empirical evidence from the experiments conducted in all cases, the modified algorithm provides faster training with an improvement in the risk of slow convergence. Another significant risk in training NNs by BP is the local minima problem [19], [20] and this threat is more common in linearly inseparable situations [7]. The local minima problem usually occurs due to the saturation of nodes in the hidden layers of feedforward NNs. In the case of saturation, a lack of harmony would occur in the weights connecting the hidden layer to the output layer [6] and the network may no longer be trained.
To determine the performance of the proposed algorithm to alleviate the local minima problem in nonlinearly separable classification tasks, the initial weights and bias values are set to one of the potential local minima point for standard BP algorithm in two-input XOR classification problem, and then the proposed algorithm is tested through the same network with same initial values.
The weights and biases for each node are initially selected as given by (17) The learning rate is chosen as unity (1). Figure 8 shows the performance of the proposed algorithm in the case of a local minimum point of standard BP.
A similar experiment is conducted using the NN architecture to solve the three-input linearly inseparable classification problem in Case II. When a local minimum is chosen with the selected initial weights given by (21) and (22) and the learning rate is chosen as unity -1, the corresponding analysis is presented in Fig. 9 Fig. 8. Performances of the proposed algorithm and standard BP to a specific local minimum problem, two-input XOR classification problem. Fig. 9. Performances of the proposed algorithm and standard BP to a specific local minimum problem, three-input linearly inseparable classification problem in Case II.
For the classification task given in Case III, Fig. 7 also gives this comparison with respect to a local minima problem. All the experimental evidence states that the proposed algorithm is less prone to the common local minima problems.
It is known that, in training the NNs, initial weights, biases, and also learning rate affect the performance of learning process directly on convergence characteristics (speed, training time, and avoiding local minima).
To obtain a generalised performance evaluation, different initial weights and biases are randomly selected, and the algorithms with different learning rates are compared. Table  IV illustrates the experimental results for the two-input XOR problem in Case I.
Both algorithms are run with three different learning rates (α = 0.1, α = 0.5, and α = 1) and the algorithms are implemented with 100 different random initial weights and biases. Random values are generated in the interval of [0, 1]. The table provides the averages of all mean square error (MSE) values in total of 100 experiments after completing the 1000 th , 5000 th , and 10000 th training cycles. Similar analysis is performed to test the proposed algorithm on avoiding the local minima problem. The twoinput XOR problem and the same NN architecture as in Case I are selected and the learning rate is taken as one. 100 different random initial weights and biases are created in the interval of [0, 1] and the standard BP and the proposed algorithm are tested in 100 experiments.
The results are recorded after 50,000 epochs and the experiments that converge to a local minimum are labelled as learning failures. Success rate is calculated as the number of experiments that converges to the global minimum. Table  V shows the success rates for avoiding the local minima for the two-input XOR problem in Case I. To make a comparative analysis of the proposed algorithm with improved versions of BP, two benchmark problems are selected. Those are standard two-input XOR problem as in Case I and modified XOR problem [21]. The training performances of the algorithms are evaluated in terms of the convergence speeds in the simulations. In comparisons, the BP algorithm with momentum (BP-M), BP with ESP function (BP-ESP) for the output nodes, BP with ESP function for hidden output nodes (BP-ESP-H) [22], BP with gain (BP-G) [23], BP with adaptive momentum (BP-AM) [24], BP with adaptive gain (BP-AG) [25], and the Nguyen-Widrow weight initialisation technique (NG-W) [26] are selected as improved versions of BP. To be parallel and consistent with the methods mentioned in [21], in the proposed algorithm, the weights and biases are initialised to random values in the range of (-0.5, 0.5), the learning rate is set to 0.5, and the performance measure is chosen as MSE. 30 independent trials are conducted and the numbers of epochs required for convergence are recorded. Then the mean of the epochs (# of epochs required to converge) is calculated. The termination condition for convergence is chosen as the MSE of 0.001.
First, to train the two-input XOR problem, 2-2-1 network (two input nodes, one output node, and one hidden layer of two nodes) is used as in Case I. The results of the corresponding comparisons on this problem are given in Table VI. Performance evaluation results for the improved versions of BP are taken from [21].  The performance comparisons of the improved versions of BP and the proposed algorithm are presented in Table  VIII.

VI. CONCLUSIONS
In this paper, a modified backpropagation (BP) algorithm with multiplicative calculus is proposed for feedforward neural networks (NNs). The proposed algorithm contains general characteristics of standard BP that make backward passes during update processes of learnable parameters. On the other hand, the originality and enhancement are the utilisation of the multiplicative form of the derivative in the computations rather than the classical derivative.
The sigmoid function is preferred as the activation functions on the hidden layer of NN and the new multiplicative derivative is applied only on the last layer not in the hidden layer.
Standard BP is very common in many NN applications but it has two major negative issues: convergence speed can be slower and the training algorithm can converge to a local minimum. The proposed algorithm introduces a novel solution to overcome the slow convergence rate problem and avoid the local minima problem. Many different tasks are chosen, and several experiments are conducted to measure the performance of the proposed algorithm. Experimental results show that the proposed algorithm with multiplicative calculus yields outstanding success at both convergence speed and avoiding local minima. Simulations carried out in Case I have demonstrated that when the learning rate is chosen, α = 1, after the 1000 th epoch, a reduction of approximately 80 % in mean square error is obtained as compared to the standard BP. Additionally, a 96 % success rate has been achieved in avoiding the local minima problem while a success rate is 85 % in standard BP.

CONFLICTS OF INTEREST
The author declares that he has no conflict of interest.