A Signal Strength Fluctuation Prediction Model Based on the Random Forest Algorithm

1 Abstract —This article describes the effect of the weather on radio wave propagation in a mobile telecommunication network. The research is focused on urban and countryside environments where a correlation between the received signal power level and weather conditions is found using the Random Forest algorithm as a signal level approximator. The results achieved in this paper clearly indicate that it is possible to predict the behaviour of the received power level in relationship to atmospheric phenomena.

communication.Moreover, the magnitude of these degradations varies with time.Therefore, it could be significantly interesting and scientifically useful to predict path loss and use the output of the prediction as the input for coverage planning, mobile radio network optimization and radio link adaptation [1].
Cell sizes depend on many factors, such as terrain profile, type of environment (urban, suburban and countryside) and transmission parameters (such as transmitted power) of the base stations.One of the factors which also affect the cell size (the radio coverage) is weather conditions such as rain [2], [3] or sand and dust storms [4].Measurements taken during the rainy season led to the modification of the Okumura Hata wave propagation model [5].The effect of wet ground on radio propagation has been studied in [6].The factors affecting path loss due to rain and snow have been implemented into the propagation model [7].Some practical improvements in the existing models for macrocells, microcells and signal prediction in the indoor environment, as well as some new models, have been studied in [8].The behaviour of electromagnetic waves propagating through densely arboreous environments [9] and through sparsely arboreous areas has been studied in [10].The effect of the environment and altitude on the UHF band has been studied in [11].In [12] a radio link was augmented by a radio channel state prediction method.In [13] we presented a K-means method that was used to decide which parameter related to weather affects the propagation of radio waves in a mobile telecommunication network.
This research is focused on proposing a way to approximate the signal strength based on its analysis with regards to possible effects by atmospheric phenomena such as humidity and temperature.The secondary contribution would be to create such an approximation for urban and countryside environments, including possible comparison.

II. DATA ACQUISITION
The data which are analysed consist of two subsets.The first subset is related to the identifiers associated with GSM technology while the second subset is related to the identifiers associated with weather.The data related to GSM are parameter values that are transferred between the mobile station (MS) and the mobile network through a signalling  [14].The parameters are acquired by a GSM modem and stored on a computer at periodic intervals.An application that synchronously stores the current values of GSM parameters and current values related to weather has been developed.The data related to weather were acquired by two professional meteorological stations located in both urban (Poruba district of the city of Ostrava) and countryside (the village of Bukovec) environments located in The Czech Republic.Given the slightly different conditions of the stations in each location, the parameters measured vary slightly.Parameters common for both locations are "date, time, temperature, humidity, dew point, wind direction, wind speed, wind chill, barometric pressure, heat index" and "precipitation per hour".Parameters specific for the Poruba station are "THW index, falling" and the time of "sunrise" and "sunset" while the sole Bukovecspecific parameter is the "pressure development" over 24 hours.The "THW index" uses humidity, temperature and wind calculate a temperature that incorporates the cooling effects of the wind into our perception of temperature."Falling" represents the trend barometric pressure.

III. PREDICTION MODEL
In order to approximate the power level of the BTS, a prediction model needs to be constructed.In this paper, the Random Forest algorithm that groups Classification and Regression Trees (CARTs) into a majority voting classifier was used.

A. Classification and Regression Tree
Originally proposed by Breiman in [15], CART is a tree structure capable of solving both classification and regression problems.Basically, a CART is a binary tree that uses a set of yes/no questions to construct its nodes by splitting an observation into two parts that are as homogenous as possible and then repeating the process for each resulting part until complete decomposition of the observation is achieved.The classification and regression modes of the CART use different algorithms that govern the splitting.
Common for both nodes is the use of an impurity function to evaluate the homogeneity of the split [16].Given that the impurity of a parent node is always constant when computing its child nodes, the impurity of the child nodes is computed as a change in impurity of the computed node, Δi(t).Let tp be a node that is the parent to its left node tl and right node tr and Pl and Pr be the probabilities of the respective nodes.The change in impurity can then be computed as Annotating an observation variable and the best splitting value of the variable , where = 1, … , and is the number of variables, CART can be observed as solving the following maximization problem in each of its nodes The difference between the classification and regression mode of the CART is in the way the impurity function is calculated.In the classification mode, the most commonly used impurity function follows this formula, known as the Gini splitting rule or Gini index where , = 1, … , is the number of classes and ( | ) is the conditional probability of class given that the current node is node .Incorporating the Gini rule, CART solves the following maximization problem As regression trees have no classes, their output is a response vector containing response values for each observation.The Gini rule cannot be applied due to the absence of classes; instead a squared minimization residual problem is solved.Defined as where ( ) is the response vector of the corresponding child node, the problem attempts to minimize the expected sum variance for the two resulting nodes.Here, ≤ , = 1, … , is the optimal splitting question capable of satisfying the above formula.
While the Gini rule cannot be applied directly, it can quite easily be adapted for the regression mode.Let objects of class be assigned the value of 1 and objects of other classes the value of 0. The variance of these values would be Summation over the number of classes then yields the impurity function commonly used in regression trees Once a tree is constructed using these rules appropriately, it can be very large, especially in regression mode.Therefore, pruning is employed to reduce the tree size to the desired value.Once pruning is complete, the tree is ready for use in classification and regression problems.

B. Random Forest
Random Forest (RF) could be considered a majority voting classification and regression method as it combines a number of CARTs into a larger structure.For each tree in the forest, a random combination of a predefined number of input parameters is chosen and used to construct it.Testing samples are then evaluated against conditions in each node and propagated throughout the tree.When the sample reaches a leaf node, it is then assigned the class or value to which the samples in that node belong.In the forest, this is performed by all trees, providing a response from each of them.The testing sample is then assigned the class that was suggested by the majority of trees.Commonly, a binary tree with logical conditions is used, as was the case in this paper.Given that an RF consists of CARTs, it is capable of working in two modes, classification and regression, depending on the task it is expected to solve.

IV. EXPERIMENTS
The following section provides a description of the data used to train and evaluate the approximator, of the approximator itself as well as the experiments performed and results measured.

A. Data Preparation
Before the data could be used as an approximator's input, optimization procedures had to be performed.The parameter set contains the Receive Level values from one service station and up to 6 neighbouring Base Transceiver Stations (BTS).These values are continually sorted from the maximum to minimum value.Unfortunately, since the value of the Receive Level is affected by fast Rayleigh's fading, the position of the cell related to the first one or one of the neighbours varied.Therefore, it was necessary to select the cell which is identified by ARFCN.
Once selected, a data matrix was assembled from the collected data for each of the selected BTSs in the common row-sample, column-value format.To construct the matrix, dates and were into 3 numeric values each, and text parameters (like wind direction) were converted into integer numbers.Overall, the Poruba meteorological station provided 21 parameters against 16 input values per sample from Bukovec.For each BTS, a different number of samples were measured, however, every BTS had an abundance of data to perform the experiments (millions of samples).
As mention in the previous subsection, the RF algorithm requires training and therefore a training set.For this purpose, a small portion of data, 60000 samples, was separated.This was the lowest number of training samples that provided the maximum accuracy.Increasing the number of training samples had no effect on the accuracy while after decreasing it the effect was detrimental.The training samples were chosen randomly with normal distribution.The rest were then used to evaluate the Random Forest method.

B. Experimental Settings
Overall, 6 different cells were used for the evaluation, two for the Poruba location with CellID 75F4 and 76B3 and four for Bukovec with CellID A863, A864, A865 and AEDD.To measure the RF performance, the Root Mean Squared Error (RMSE) between the expected and resulting output was used.The approximation accuracy was measured for two different approximation scenarios.The first scenario, scenario1, allows RF trees to choose from all possible input parameters.The second scenario, scenario2, limits the number of parameters to a mere three, each representing a water-related weather attribute -temperature, humidity and dew point.This scenario is a natural reaction to the empirical observation suggesting the parameters that most closely correlated to the BTS power levels are related to water.
The settings used for the experiments were as follows: 6 randomly selected parameters per tree and 1000 trees in the forest for scenario1, 2 parameters and 3 trees for scenario2, regression mode.While the number of trees for scenario2 seems low, 2 parameters out of 3 can only be combined 3 different ways.Thus, more trees would result in redundant identical trees in the forest which could, according to the majority voting principle of the algorithm, influence the performance in favour of one of the combinations.In scenario1, the parameters can be combined in thousands of different ways.

C. Results
Given the random nature of choosing parameters during RF tree constructions, the experiments for each BTS were performed in 10 trials and the results were averaged.Each trial was made with a different, randomly chosen training set.Table I shows the resulting RMSEs for scenario1.It can be seen that while RF is a random algorithm that, given the number of possible input parameters in scenario1, cannot cover all possible combinations, it is quite robust in its performance.The particular results in individual trials vary only insignificantly.Table II presents the results for scenario2, where the average approximation RMSE was, as expected given the much lower number of input parameters, higher.
Aside from the absolute and average RMSE of each BTS, both tables also show the standard deviation of the results.The deviation across trials is miniscule and therefore insignificant, showing the robustness of Random Forest when trained using different samples.
Figure 1 illustrates Random Forest's capability of fitting its output to the expected values.The horizontal axis is not important as it only shows the index of a sample; the vertical axis, however, expresses the output parameter, the received power level.The light-coloured polyline illustrates the expected value while the darker polyline is the approximation of 200 randomly chosen test samples.The results proved that it was more difficult to approximate the power level in the urban rather than in the countryside environment.A sparsely populated countryside environment indicates different propagation characteristics than a densely populated urban environment.Countryside path loss values are lower than urban path loss values because the countryside areas are composed of open land with small buildings and plain areas.Moreover, in countryside and open areas, the range of slow fading is lower than that in suburban and urban areas.These facts contribute to a more complicated and difficult approximation, as shown above.

V. CONCLUSIONS
Based on the results in this paper, we are able to predict the behaviour of the received power level affected by atmospheric phenomena.The average RMSE proved that the proposed model can be applicable in a method for increasing the efficiency of power consumption for a base station.It is well known that the main source of energy consumption in cellular mobile networks is the base station.Therefore, reducing the energy consumption of BTSs as the main energy consumers is extremely important.The research in this paper suggests that the radiated power level of BTS can be adjusted while considering the atmospheric conditions, which leads to an improvement in the energy efficiency of the mobile radio network.
In their further work, the research team are to focus on the same issues in different frequency bands related, for example, to Digital Audio/Video Broadcasting technology or the new generation of mobile network such as Long Term Evolution (LTE).Formalizing and mathematically describing the relationship between the received power level and mentioned weather parameters, implementing the researched technique into a hardware solution, combining different kinds of input data or developing whole new approaches to solve the problem, possibly based on hybrid paradigms, are other topics of both future and currently ongoing research.

Fig. 1 .
Fig. 1.An illustration of the Random Forest results fitting the expected values.

TABLE I .
RESULTING RMSES FOR THE FIRST SCENARIO.

TABLE II .
RESULTING RMSES FOR THE SECOND SCENARIO.