Estimating Harvestable Solar Energy from Atmospheric Pressure Using Deep Learning

1 Abstract —This article focuses on applying a deep learning approach to predict daily total solar energy for the next day by a neural network. Predicting future solar irradiance is an important topic in the renewable energy generation field to improve the performance and stability of the system. The forecast is used as a support parameter to control the operation duty-cycle, data collection or communication activities at energy-independent energy harvesting embedded devices. The prediction is based on previous hourly-measured atmospheric pressure values. For prediction, a back-propagation algorithm in combination with deep learning methods is used for multilayer network training. The ability of the proposed system to estimate the daily solar energy is compared to the support vector regression model and to the evolutionary-fuzzy prediction scheme presented in previous research studies. It is concluded that the presented neural network approach gave satisfying predictions in early spring, autumn, and winter. In a particular setting, the proposed solution provides better results than a model using the support vector regression method (e.g., the MAPE value of the proposed algorithm is 0.032 less than the MAPE value of support vector regression method). The time and computational complexity for neural network training is considerable, and therefore it was assumed to train the network on an external computer or a cloud, where only the network parameters have been obtained and transferred to the embedded devices.


I. INTRODUCTION
Energy-independent embedded systems are specially designed for deployment sites [1], where it is not possible to connect such devices to a power grid. A common solution for supplying energy to these devices is using batteries to satisfy energy demands. Due to the frequent deployment in inaccessible locations, this solution has several drawbacks. Another solution includes a combination of rechargeable batteries and an energy harvesting subsystem. This approach, compared to a solution without energy harvesting, needs an energy management strategy, where a knowledge of the future energy availability can help to design a more efficient energy management strategy. This approach provides important input parameters represented by a future energy estimation and it allows the better scheduling of data collection tasks and the system can work efficiently in terms of the distribution of computational power [2], [3].
In general, predicting the future solar irradiance is an important topic in the renewable energy generation field. It is used to improve the planning and operation of the photovoltaic systems [4]. The accuracy of the prediction improves the performance and stability of the systems [5]. The solar availability can be predicted using many methods, such as statistical methods, artificial neural networks (NN), support vector machines, and autoregressive moving averages [4]. Besides these methods, other methods can also be used (regression tree, random forest, gradient boosting, and many others) in the context of the estimation of the future solar energy [6]. In general, these principles can be transferred to the energy harvesting embedded system area to estimate the future obtained energy when the estimation is based on an internal atmospheric pressure sensor.
Machine learning (ML) methods, such as deep NN, are suitable for estimating future solar energy because they add complexity to a model without specifying what form the variation should take and allow the extraction of high-level features [7]. To improve the prediction performance, some authors proposed using hybrid models or an ensemble forecast approach [6], [8]. Deep learning enables the computational model (e.g., a multilayer NN) to create and remember a new representation of the input data [9]. Nowadays, models trained through deep learning have succeeded in the fields of speech and image recognition, object detection, time regression rows, etc. [10].
Deep learning provides tools for solving complex problems using multilayer NN. A fusion of NN and deep learning approaches lead to Deep NN, which can be defined as NN with several (at least three or more) hidden layers [9]. The NN ability to solve more complex tasks increases with the increasing number of layers, but there are increasing demands on the complexity and time to train such a NN. Many problems solved by deep NN are based on mapping input data vectors to output vectors. The fulfilment of these tasks is conditioned by a sufficiently complex NN topology with a large task-representing data set consisting of a set of input vectors and their corresponding output vectors. Tasks that cannot be implemented using input-to-output mapping are now largely beyond deep learning [10]. This paper focuses on the application of deep learning and NN approaches, especially on deep feedforward networks, to forecast the next-day solar energy availability from the atmospheric pressure in an energy harvesting embedded system. A back-propagation method is used to train a multilayer NN in combination with deep learning methods, such as optimalization, batch normalization, and initialization of weights. The algorithm implementation was executed in the C programming language. The results of each combination are compared to previous published approaches and an overall evaluation is provided.

II. DEEP FEEDFORWARD NETWORKS
The aim of the deep feedforward NN is to approximate a complex function f  by defining the mapping of the input vector x to the initial vector .
y The mapping has the form of (1), where  represents the set of parameters resulting from the functional approximation [10]   ,.
The deep feedforward NNs are formed by the hidden layers. The input and output layer are defined by the structure of the input and output data. The properties of a NN are determined by a topology, number, and width of the hidden layers. With the increasing number of layers and neurons, the complexity of the network training increases, but the ability of NN to solve more complex tasks also increases [11], [12].
One of the deep learning methods is the gradient descent algorithm, which is used for NN learning. The gradient descent is a method of minimizing the purpose function   J  by parameterizing the model parameters .
n R  The parameterization is realized by updating the parameters in the direction opposite to the gradient of the objective function   J   [13]. In general, there are three basics configurations of the gradient descent -a batch gradient descent (an ordinary gradient descent), a stochastic gradient descent, and a minibatch gradient descent -which differ according to the volume of data from the total dataset that is used to calculate the gradient of the objective function [14]. The most used gradient algorithm for descent optimization is backpropagation (BP). Generally, BP is a type of batch gradient descent, but it can be converted to the other two types. The choice of type depends on the application, time requirements, availability, and data content [10].
The form of the stochastic gradient descent is shown in The parameter is updated for each training sample x d and its corresponding result t d [15].
The stochastic gradient descent performs frequent updates with a high deviation, which causes high fluctuations in the purpose function. However, these fluctuations may allow them to escape from the local minimum, which may be advantageous, especially for the non-convex surfaces of the purpose function. On the other hand, it slows down convergence to the exact minimum, though this can be removed to some extent by optimizing the learning coefficient  (with a very good choice of ,  the same convergent behavior can be achieved as with the batch gradient descent) [14]- [30]. The learning coefficient α determines the length of the step that is in progress when updating to reach the (local) minimum. In other words, we follow the direction of the steepest slope of the surface of the purpose function in steps [30]   ;; ; ; .
The Mini-batch gradient descent is a compromise between the two approaches. This gradient descent method performs an update for each mini-batch of training data; therefore, the entire training data set is not used, but only a certain part of it, see (3) [16], [17]. Unlike the stochastic gradient descent, this method is not as sensitive to changes in hyper-parameters, especially the learning rate used during optimization, which leads to a reduction in fluctuations and thus to a more stable convergence [14].

A. Back-Propagation
Back-propagation (BP) is a commonly used algorithm for training the multilayer feed forward NN [16]. There are various learning parameters, such as the learning rate, momentum or activation function which can improve the BP learning algorithm. These improvements significantly speed-up NN convergence and avoid convergences in the local minimum [18]. These approaches are described further in the subsections "Initialization of weights" and "Optimization methods".
The BP method requires a set of training data consisting of ordered pairs of vectors, where one pair is formed by a vector of input values ˆj x and the other pair is formed by a vector of required outputs ˆj y for input ˆ. j x The set contains N of these training vector pairs [10].
The algorithm of the BP method consists of the following steps. Firstly, forward transmission is performed for the input-output pair of vectors where E is error function and  is set of parameters resulting from the best functional approximation, is stored. This step can be divided into the following sub-steps: 1. Calculate L j  , i.e., calculate the auxiliary variables  for the output layers L, see (4)  For the input layer by (7) , , 4. Perform steps a-c for all input-output vector pairs.
6. Repeat the entire procedure until the total NN error is less than the specified accuracy, or until the completion of a specified number of iterations [18], [19].

B. The Problem of the Vanishing-Exploring Gradient
The problem of the vanishing-exploring gradient especially occurs in recurrent and multi-layered NN. The problem concerns the updating of NN parameters during learning. The principle of the vanishing gradient lies in the decreasing value of the partial derivative of the error function of a particular parameter during the learning process, which leads to a complete cessation of convergence. This causes the parameters in the remote layers to change significantly more slowly than in the layers close to the output layer. The simplest solution is to choose an activation function whose derivative is greater (e.g., ReLU). On the other hand, if the learning coefficient is chosen incorrectly, using the ReLU activation function can cause divergence. This problem can be largely eliminated by applying optimization methods and the appropriate initialization of parameters or pre-training [20], [21].

C. Initialization of Weights
It is necessary to set the NN parameters before starting the learning algorithm with the proper selection of the initialization method. Those parameters should correspond to the chosen activation function (e.g., for sigmoid function it is suitable to initialize weights in the interval (0, 1), for hyperbolic tangent function, the interval (-1, 1)) [22] is more suitable.
There are many weight initialization configurations, such as the Rectified Linear Unit (ReLU) activation function, hyperbolic tangent function, and sigmoid function. For the ReLU activation function, the correct weight initialization is not a major problem. For any nonnegative number, the gradient is equal to the internal neuron potential. For a negative number, the gradient is equal to 0 or the parameter .
 However, for faster convergence, it is advisable to initialize the weights with numbers from the interval defined by (10), where n is the number of inputs in layer k For ReLU, Xavier initialization can be used with a normal distribution in the interval defined in (11) For the sigmoid activation function, the uniformed Xavier initialization method in (13)

D. Optimization Methods
Optimization methods compensate for the problem of an increasing or decreasing gradient (i.e., a situation where either the gradient converges rapidly to zero or, conversely, diverges to infinity, such as the problem of a vanishingexploring gradient) [25].
Adam is an algorithm for optimizing a stochastic object function based on the Jacobian calculation. The algorithm is computationally efficient with no extensive memory requirements and it is invariant to the diagonal scaling of gradients. This approach is based on the calculation of individual learning NN coefficients from the estimation of the first and second moment of the gradient [22]. The algorithm combines the advantages of the AdaGrad method, which is suitable for solving problems with small gradient values [5], and the RMSProp method, which is suitable for the online learning mode (stochastic gradient descent) with non-stationary settings [26]. The advantage of Adam's algorithm is that the magnitudes of the parameter changes are invariant to the gradient scaling. It does not require a stationary object and it is also suitable for cases where the gradient takes on small values [22].
Nesterov Momentum is a modification of the traditional momentum method. The gradient is calculated at the point that would be reached after performing a previous step  times. This point is obtained by summing the original values of the parameter with a multiple of the previous vector V (initialized to zero) with the friction parameter  [25].
RMSProp is one of the adaptive learning rate methods, i.e., methods that optimize the learning coefficient during NN training. It uses "signal-to-noise" normalization with an absolute magnitude of the gradient in combination with exponential averaging [25], [26].

E. Batch Normalization
The training of deep NN is complicated by the fact that the inputs of all neurons in the hidden layers and the output layer are affected by the parameters of the previous layer's neurons. During training, the parameters of neurons, and thus the activation variables and inputs of neurons, are changing. Changes in neuron inputs in the previous layer are transmitted to the current and subsequent layer. This propagation causes instability leading to reduced convergence or even divergence in the deeper layers. Such a problem is described as a change in the distribution of layer inputs. The change in layer distribution is referred to as an internal covariance shift [27], [28].
The aim of the batch normalization is to reduce the internal covariance shift by adding a normalization layer. The normalization layer can theoretically be placed before the activation function application (this is pre-activation batch normalization) or at the output of the activation function (post-activation batch normalization). Practically, only the pre-activation version is usually used [27]. Batch normalization allows the application of a larger learning coefficient without the risk of divergence. It also performs model control and thus eliminates the need to use the Dropout control [29]. The normalization pre-activation layer is added for each neuron according to (14)- (18):

III. DEEP LEARNING EXPERIMENT
The aim of the presented experiment was forecasting a solar energy availability for the following day by using a deep learning algorithm with a multilayer perceptron. A back-propagation algorithm was used to train multilayer NN in combination with deep learning methods, such as optimalization, batch normalization or the initialization of weights. Each of these deep learning methods has several configurations which will be compared. The deep learning experiment algorithm for one configuration of deep learning methods is shown on Fig. 1. First, a test configuration of deep learning methods is set. Then the NN training is performed using the dataset 2012. Subsequently, the trained network is tested using the dataset 2013, which results are the prediction of the available solar energy. Then the evaluation parameters of the resulting dataset are calculated. These parameters are used for comparison to other test configurations of deep learning methods, which are performed.
Outputs of the forecast algorithm were validated by a Mean absolute percentage error (MAPE). All the experimentally chosen combinations of the learning algorithm settings and other forecasting approaches were compared using a Root mean squared error (RMSE).
Two datasets (2012, 2013), which contain hourly measured values of the atmospheric pressure, were used in this experiment. The dataset from 2012 was selected for NN training. The trained NN was tested using the dataset from 2013. The prediction of the solar energy availability for the following day based on the test dataset (2013), the calculation of the evaluation parameters (MAPE, RMSE), and the storage of the forecast output data is intended. The evaluation parameters were compared among all chosen combinations of the learning algorithm settings.

A. Deep Learning Experimental Settings
A deep feedforward network approach was selected for the realization of the experiment. The structure of the network was designed by an experimentally chosen number of layers and various types of activation functions (sigmoid, hyperbolic tangent, softsign, and ReLU (R), identity (I)). BP in three variants of gradient mode (mini-batch gradient descent (M), stochastic gradient descent (S), and batch gradient descent) was selected for the network training. BP in mini-batch mode was used to learn a model using backnormalization (BN). Due to the small number of data samples, the batch size was always chosen for the minibatch gradient descent variant in the range from 0 to 10. The optimization was chosen among three optimization modes (ADAM, RMSProp with Nesterov Momentum (RMSProp) and Nesterov Momentum (NEST), none of them). There were three options to select from the initialization of the weights (Xavier, Sigmoid, and Glorot). The network configuration and training process settings were chosen experimentally.

B. Experimental Data
The experiment input data consisted of hourly measured values of the atmospheric pressure. The dataset from the entire year of 2012 was used for training the deep NN, and the dataset from the entire year of 2013 was used for the algorithm testing. The values of the atmospheric pressure in one day (24 values) and the value of available daily solar energy represented one input vector of the NN. The output vector consisted of an estimate of the value of the available solar energy for the following day. The range of input and output data values was normalized to a range from 0 to 1.
The input vector (the measured pressure values in Pa) and the output vector (the values of the daily solar energy availability in 2 / Jm ) of the training dataset must be normalized. Therefore, a data scaling procedure by mean normalization (19) was implemented , max( ) min( ) where These parameters are represented by:  is the site latitude,  is the transmissivity estimate, and G is the extra-terrestrial solar radiation. This calculation was taken from the work in [30] and ˆA E values were used for designing the output vector of the training dataset for the NN learning.

A. Reference Methods
The ability of the deep learning system to estimate daily solar energy was compared to the evolutionary-fuzzy prediction scheme (FR) and support regression model (SV). The reference results, which stated FR demonstrates better results than SV, were obtained from the work in [30].

B. Evaluation Parameters
After the NN training, the NN was tested using a dataset from the year 2013. The results of the forecast algorithm were validated by calculation of MAPE (21) where n is the number of the predictions, i y is the real value of the pressure, and i x is the predicted value of the pressure.
To compare the accuracy of the forecast with other approaches, the RMSE (22) was calculated x y n    (22) where n is the number of the predictions, i y is the real value of the pressure, and i x is the predicted value of the pressure.

V. RESULTS AND DISCUSSION
The NN was tested with the experimentally chosen settings of the training parameters. The experiment used two datasets which contain hourly measured values of the atmospheric pressure. The first of them was used for network training and the second one was used for the trained network testing. The evaluation parameters were calculated to compare them to each other and to the results of the reference methods for each experiment configuration. Table I shows the MAPE and RMSE values of the individual configuration settings in the deep learning experiment, which performed best in experimental testing. The selected methods of the individual configurations are described in Table II. The N2 configuration (Nesterov, Batch Normalization) has the lowest value of MAPE (0.235) of all the proposed setting combinations. A comparison of MAPE values of the N1 (0.244) and N2 (0.235) configurations, which had the same settings except for the number of layers, shows that it is better to choose a larger number of layers for a more accurate forecast. The A1 (MAPE = 0.249) and A2 (MAPE = 0.240) also had the same configuration apart from the initialization method and their comparison showed that the mode with Sigmoid is better than with Xavier. Once the optimization methods were compared, it was shown that the best results are characterized by NEST (N1 -MAPE = 0.244), followed by ADAM (A1 -MAPE = 0.249, A4 -MAPE = 0.310) and RMSProp (R1 -MAPE = 0.330).
It was also observed that the vanishing-exploring gradient problem occurs in several initial iterations with inappropriately chosen settings. The vanishing-exploring gradient problem more often occurs at settings without an optimization method or, in the case of the NEST optimization method, with an inappropriately chosen learning coefficient, which confirms the described theoretical background. The RMSProp and ADAM optimization methods generally required a smaller value of the learning coefficient then NEST. The learning coefficient value was also dependent on the selected learning mode, i.e., if a stochastic gradient descent or mini-batch gradient descent was selected (for the mini-batch gradient descent, it also depended on the batch size). In general, it could be said that the result of the learning was highly conditioned by the selection of the learning coefficient. The speed of the convergence was also affected by the initial parameter initializations, which were initialized by the Xavier or Sigmoid method. Better results were obtained with the Xavier initialization method.
During the testing of various testing configurations, it was observed that if the total iteration number was increased, the network was re-learned and was not able to respond to sharp fluctuations (typical for the summer period). The forecast of the re-learned network corresponded to a filtering type or actual value averaging. The optimal number of iterations was between 10,000 and 15,000 for the value of MAPE of each configuration except A3 and A4 (the number of iterations was set at 80,000).
Regarding the gradient mode, it was empirically established that the batch gradient was not suitable for network training with the dataset (2012) because of unsatisfactory results. Therefore, no configuration with the batch gradient mode appeared in the top results of Table I and the gradient descent mode was just chosen from among the mini-batch mode and stochastic batch mode. A comparison of the results of A3 (MAPE = 0.240) with A1 (MAPE = 0.249) and A4 (MAPE = 0.310) shows that the stochastic gradient descent is significantly better than the mini-batch.
A comparison of the results from the previous study [4] to the deep learning experiment is shown in Table III. The table shows that the best result was achieved using the evolutionary fuzzy rules model (FR), which exceeds all the predictions obtained. However, the best forecasts from the trained NN surpassed all the above support vector regression models (SV). It could also be said that the forecasts in spring, autumn, and winter were more accurate in the case of the deep learning experiment. Figure 2 shows that the NN presents satisfying predictions in early spring, autumn, and winter because there were no significant fluctuations in daily solar energy availability. In terms of summer months and late spring, the forecast did not correspond to sudden fluctuations, which are typical for these months. In Fig. 2, a prediction of the specific configurations (A2, N2, and R2) is shown. The differences among these individual configurations are not statistically significant, as evidenced by the MAPE values.

VI. CONCLUSIONS
This paper introduces the application of deep learning methods with the aim of training a multilayer NN, which can use a prediction tool to estimate future incoming solar energy. The BP algorithm in combination with deep learning methods, such as optimalization, batch normalization, and the initialization of weights, was used for network training.
In the results, it was observed that the N2 configuration had the lowest value of MAPE (0.235) than the other deep learning settings configurations. This MAPE value was 0.032 less than the lowest value of the support vector regression method (the MAPE value of the SV R_L was 0.267). It could be said that the results of the experiment are comparable, and settings of the proposed solution are better than in models using the support vector regression method because all the MAPE values of the support vector regression are higher than the MAPE values of the proposed algorithm (N2). The results also showed that the forecast of the deep learning model did not perform with as satisfactory results in comparison to the evolutionary fuzzy rules method (the MAPE value of the FR was 0.224), based on that the MAPE value of N2 was 0.011 greater than of FR. Also, it could be clearly concluded that the NN gave satisfying predictions in early spring, autumn, and winter because there were no significant fluctuations in the daily solar energy availability. In terms of the summer months and late spring, the forecast did not correspond to sudden fluctuations, which are typical for these months.
In the article in [31], there were used six neural methods to predict energy consumption -artificial neural networks The time and computational complexity for NN training was considerable, and therefore it was assumed the network was trained on an external computer or a cloud, where only the network parameters have been obtained and transferred to the embedded devices. Using the parameters obtained in this way, it was simple to implement a prediction model of an NN on an embedded system, because in the prediction phase, the network is represented by a look-up table. The memory consumption of the model depended on the size of the network (which determines the total number of parameters representing the NN), the type of architecture and the selected data type.
It is important to emphasize that the outcome of the learning process was greatly affected by the initialization of the parameters that used pseudo-random number generators. Therefore, different NN training results could be achieved with the same settings. A possible improvement could be the use of more datasets measured over several years to train the NN. Another possible aspect could be using the Dropout or L2 control method.