Self-Organizing Networks : A Packet Scheduling Approach for Coverage / Capacity Optimization in 4 G Networks Using Reinforcement Learning

The next generation mobile networks LTE and LTE-A are all-IP based networks. In such IP based networks, the issue of Quality of Service (QoS) is becoming more and more critical with the increase in network size and heterogeneity. In this paper, a Reinforcement Learning (RL) based framework for QoS enhancement is proposed. The framework achieves the coverage/capacity optimization by adjusting the scheduling strategy. The proposed self-optimization algorithm uses coverage/capacity compromise in Packet Scheduling (PS) to maximize the capacity of an eNB subject to the condition that minimum coverage constraint is not violated. Each eNB has an associated agent that dynamically changes the scheduling parameter value of an eNB. The agent uses the RL technique of Fuzzy Q-Learning (FQL) to learn the optimal scheduling parameter. The learning framework is designed to operate in an environment with varying traffic, user positions, and propagation conditions. A comprehensive analysis on the obtained simulation results is presented, which shows that the proposed approach can significantly improve the network coverage as well as capacity in terms of throughput. DOI: http://dx.doi.org/10.5755/j01.eee.20.9.4786


I. INTRODUCTION
The mobile networks have undergone an enormous growth in terms of size and complexity during the last few years.This has resulted in a significant increase in the Operational Expenditure (OPEX) of the Next Generation Mobile Networks (NGMN) [1].In this context, selfoptimization has been included in LTE standardisation as part of the Self Organizing Networks (SON) [2], [3].The objective of self-optimization is to decrease the network OPEX by introducing automation into the network.While at the same time, we enhance the Quality of Service (QoS) of the network by the optimal setting of Radio Resource Management (RRM) parameters.The network QoS is measured in terms of Key Performance Indicators (KPIs) Manuscript received January 11, 2014; accepted September 28, 2014.related to coverage and capacity.SON entities are supposed to operate in an environment with varying traffic, changing propagation conditions, newly introduced services, and evolving management policies of the operator.
Academia and industry have worked on the selfoptimisation in Radio Access Networks (RANs) since the last decade [4], [5].Automated self-optimisation enables the operators to enhance the network performance and profitability while at the same time reducing the amount of management operations.The self-optimization algorithms can be implemented in the Operation and Maintenance Centre (OMC) of a network.Self-optimisation has been investigated for the land mobile radio cellular communication technologies of GSM and UMTS as in [6]- [8].However, despite all these industrial and academic research efforts, self-optimization was not included as a part of UMTS standard.Research has been extended to selfoptimization in heterogeneous applications, mainly for load balancing purposes [9].With the advent of LTE, the focus of research shifted to the self-optimization of LTE.The recent research on LTE self-optimization has mainly focused on dynamically optimising Radio Resource Management (RRM) parameters, like: resource and bandwidth allocation [10], Inter-Cell Interference Coordination (ICIC) [11], [12] and load balancing [13], [14].
It has been shown in [9] and [15] that rules of Fuzzy Logic Controller (FLC) can be optimized using Q-Learning (QL).Consequently, these rules are used for automatic network parameter optimization.FLC has the ability to model a controller as a set of 'IF-THEN' rules.Such rules may be designed by using some previous history of the network behaviour.However, in the case when no such previous knowledge is available, Reinforcement Learning (RL) techniques such as QL can be used to derive/optimize FLC rules.Such Fuzzy QL (FQL) algorithm has been used in [9] to achieve performance optimization by dynamic load balancing between UMTS and LAN networks.FQL has also been used for the optimization of mobility parameters of Self-Organizing Networks: A Packet Scheduling Approach for Coverage/Capacity Optimization in 4G Networks Using Reinforcement Learning both GSM Edge Radio Access Network (GERAN) [15] and LTE network [16].More recently, FQL has been used for coverage/capacity optimization of LTE networks by adjusting the vertical tilt angle of the antenna employed at eNBs [17]- [19].
This paper examines the use of FQL to optimize the Packet Scheduling (PS) to achieve maximum eNB capacity while satisfying the minimum coverage constraint.α-fair scheduling is the type of PS used in this work [20].α-fair scheduler provides a generalization of the well-known schedulers including Proportional Fair (PF), Max Throughput (MTP), and Max-Min Fair (MMF) schedulers.
The α parameter of an eNB can be tuned to achieve a compromise between capacity (higher throughput for its mobile users) and coverage (serving higher number of users at a time).Hence, in the case of eNBs with degraded performance, the optimization process trades capacity for coverage to achieve the required minimum coverage constraint.For the eNBs satisfying the minimum coverage requirement, coverage is traded for capacity to achieve additional capacity gain for such eNBs.
The contribution of this paper is the proposal of a novel self-optimization procedure for coverage/capacity optimization based on PS.Furthermore, this approach has the advantage of being scalable with increasing network size; this is because, adjusting α-parameter of an eNB has very little impact on the KPIs of its neighbours [21].The targeted network architecture for the proposed scheme is LTE and LTE-A.The simulation results have been obtained for the case study of LTE network, which show significant improvement in the network performance.Since, the basic network architecture of LTE and LTE-A is the same, apart from some new added features in LTE-A, like carrier aggregation, enhanced MIMO, and Coordinated Multipoint (CoMP) transmission.Therefore, the proposed scheme is also valid for the case of LTE-A networks.
The rest of the paper is organized as follows: Section II presents the details of α-fair scheduler used in our case study.Section III describes the Multi-Agent RL based framework.Section IV details the FQL algorithm along with its various components to solve the Multi-Agent RL problem.Section V describes the simulation environment and provides the obtained simulation results along with a thorough analysis of the results.Section VI concludes the paper.

II. α-FAIR SCHEDULER
LTE uses OFDMA as the radio access technology.Consider an LTE eNB with frequency bandwidth subdivided into K Physical Resource Blocks (PRBs).N users are attached to the eNB.P denotes the scheduling policy that schedules a user on a given PRB at the scheduling instant , ( ) .
is the instantaneous throughput of user i at instant u t on PRB k.While the mean throughput of user i during time where, 0   is a small averaging parameter and δ represents the Kronecker's delta.
The user to be scheduled on PRB k at time where 0 , 0 i t r i   .Here, 0 d  is chosen to have very small value that avoids singularity at zero.
Hence, the mean throughput of user i during the time interval   0 , u t t can be calculated as An eNB utilizes all its scheduling resources, i.e., PRBs even if at least a single user is connected with it.Therefore, changing the α parameter of an eNB will result in little effect on the neighbouring eNBs' KPIs.
Equations ( 2) and ( 3) show that for α = 0, the α-fair scheduler acts as the MTP scheduler.Similarly, the α-fair scheduler changes from MTP to PF scheduler for α = 0 → 1.Furthermore, for α = 1 → ∞, the PF scheduler evolves into MMF scheduler.The capacity of the eNB given as i r changes from a maximum value to a minimum value as α = 0 → ∞.While at the same time the coverage given as number of users served changes from a minimum to maximum value as PF scheduler tries to achieve fairness.

III. QL FOR SELF-OPTIMIZATION IN LTE
QL models the LTE network as a Multi-Agent RL [23] system where an agent is associated with each eNB.The agents interact in real time with the environment by sensing its state and taking an appropriate action to maximize the reward.The agent also exploits the knowledge gained from the experiences as a result of the past actions.The learning process is characterised as a Markov Decision Process (MDP) [12].As due to the phenomenon like interference, mobility, change in UE traffic distribution and propagation conditions etc. the mobile network inherent dynamics follow a transitionary model.
QL is particularly useful for the optimization problems where the system model is not available as a closed-form expression.In such case, the learning problem is incrementally solved using Temporal Difference (TD) method [23].In QL, an agent selects those actions which maximize the long term received reward, given as where r denotes the instantaneous reward as a result of an action.γ represents the discount factor.If γ is close to 0, the agent/controller gives more importance to the maximization of immediate rewards.While for γ close to 1 the future rewards almost as important as immediate ones.The γ value is set to 0.95 in the present work [12].Consider, an agent senses the initial environment state to be s and takes an action b∈B as it follows a fixed policy π.
(s,b) denotes the state-action pair.QL continuously updates and estimates the state-action pair to achieve objective in (4), as shown below   0 , ( , ) , .
The ( 5) is solved iteratively as follows [23]  where, κ is learning rate with value between 0 and 1.
IV. FUZZY Q-LEARNING QL algorithm solves the optimization problems where system state space is discrete.However, in the case of our LTE network optimization problem the KPIs and RRM parameters are continuous.Hence, the system states are also continuous, leading to enormous complexity.The problem is solved by using fuzzy logic to discretize the state and action spaces.A Fuzzy Inference System (FIS) [24] is shown in Fig. 1.The state vector s is input to the FIS.Fuzzifier is the first element of FIS. the degree to which each continuous (crisp) element of the state vector s belongs to each the fuzzy sets using membership functions.This procedure is known as This degree of membership information is then used by Fuzzy Logic Controller (FLC) [12], [25] to calculate output action for each of the triggered rules.The process of defuzzification maps these actions into a crisp (continuous) value.The fuzzy rules are optimized using QL to form a Fuzzy QL (FQL) optimization process.

V. COMPONENTS OF FQL RL SYSTEM
The main components of the FQL based RL system, proposed in this paper, are given as below.

A. State
The proposed state vector, corresponding to eNB c, which is input to the FQL controller, is defined as follows where, c  is the value of α parameter for the eNB c.While, c BCR denotes the Block Call Rate (BCR) of eNB c.

B. Policy
The action of each eNB is to change its α according to the policy π. π :s → b maps the state s of an eNB to the action b ∈ B. Where, B is the set of all possible actions (α value for the eNB).

C. Instantaneous Reward
The reward in the proposed FQL system is the instantaneous average throughput per user t r .Let M denote the total number of mobiles in active communication with the network at any given instant t, t r is given as where   , is constrained to be less than the threshold value th BCR .

D. FQL Algorithm Description
This section presents the FQL algorithm [24].Let the state vector, 1 , , , ,  , where j is the j th element of state vector before fuzzification.After fuzzification, the membership function T(s) quantifies the degree of membership of an input value j s to a specific fuzzy set corresponding to a fuzzy label.The fuzzy label of j s , denoted as j F , can be 'LOW', 'MEDIUM' and 'HIGH'.If R denotes all the rule of a FLC, then rule r∈R is given as where 1 , , , , is the modal vector corresponding to rule r and represents a fuzzy state.While r a is the fuzzy label for the action corresponding to r F .
The q-value   , r r q F a corresponding to fuzzy state r F and action r a , is initialized to zero.The degree of truth r T for each rule r∈R is given as where r j F m is the membership function of j s for label r j F .
Exploration/exploitation policy (EEP) dictates the action chosen for each of the activated rules.EEP policy uses εgreedy method for choosing the actions: where L denotes the indices of the of the set of possible actions for a given triggered rule r.The  can be assigned a value between the interval [0, 1] to determine the exploration/exploitation compromise.The inferred action, after the defuzzification, for a given input state vector s and the triggered rules in R are given as The associated quality of the rules is calculated as Now as a result of the applied action, the eNB transits to a new state 1 t s  .The value function   The updating of q value requires that first difference between the quality value Q  of the old and the new state be calculated as , .
The q values can now be updated using the normal gradient descent method where  is the learning rate.

E. Simulation Scenario
An LTE network consisting of 12 eNBs is simulated using a semi-dynamic simulator.The details of simulator are given in [13].The traffic model used is the downlink streaming to support H.264 with variable bitrates from 64 Kbits/sec to 50 Mbits/sec.The detailed parameters for the simulated dense urban scenario are given in Table I.
Monte Carlo simulations are performed by taking the snapshots of the network evolution with the resolution of one second time step.At each time step Call Admission Control (CAC) is performed for new users, mobile positions are updated and Handover (HO) events are processed.Furthermore, the mobiles that are dropped or complete their streaming session duration, leave the network.The description of CAC procedure is given as follows: when a new mobile user arrives, (3) using ( 2) calculates its bitrate along with calculating the bitrate of already scheduled users.( 2) uses the quality tables, obtained from denotes the SINR of user i at instant If the bitrate of the new mobile user is above 64 Kbits/sec, it is admitted to the network.Otherwise, it is blocked.The streaming session of a mobile is terminated prematurely (dropped) if its bit rate falls below the threshold value of 64 Kbits/sec.The mean Average Bitrate (ABR) of mobiles in an eNB, is used as KPI of an eNB's capacity.While, mean BCR is used as KPI of an eNB's coverage.A lower value of α for an eNB signifies that lower SINR users are assigned less resources (PRBs).Hence, the CAC procedure may not allow a lower SINR user with bitrate less than 64 Kbps to be accepted in the network.This further results in an increase of mean BCR of the eNB (i.e., bad coverage).While at the same time, mean ABR also increases (i.e., good capacity) as higher SINR users are assigned more resources.On the contrary, higher α value results in more resources being assigned to lower SINR users to achieve fairness among all the users.Hence, mean BCR decreases with an increase in admitted mobile user to an eNB (i.e., good coverage); while, mean ABR decreases (i.e., bad capacity).
The simulator operates in two modes i.e., static and dynamic mode.In static mode, there is no self-optimization.The simulator runs for 5000 time steps with default α value set to 1 for all eNBs.The KPIs are calculated by computing the average for the time steps from 500 to 5000.Here, the initial 499 seconds are not considered, as initially the network is in transient state.In the dynamic mode or selfoptimization mode, the FQL algorithm adapts the α of an eNB with the periodicity of 50 seconds.The learning rate is set to value of κ = 0.1, as taken in [12].The simulations are performed over a time period of 150000 seconds.

F. Simulation Results
The results obtained by the α adaptation using the FQL approach have been compared with the reference system, where α is fixed to the value of 1.Here, global mean ABR of mobiles in network is an indicator of network capacity while global mean Access Probability (AP), which is (1mean BCR), is an indicator of network coverage.
Figure 2 compares the global mean ABR of mobiles of the two systems.The application of self-optimization results in significant improvement in the performance as compared to the case with no self-optimization.A maximum improvement of up to 10 % can be observed for the traffic value of 4 arrivals/sec.On the other hand, Fig. 3 shows a slight degradation in the global mean AP for the selfoptimisation case.The coverage has been compromised/traded with capacity (smaller α parameter values for eNBs).However, the degraded value of global mean AP does not fall below the threshold of 90 % up till the traffic value of 5 arrivals/sec.Beyond this traffic value, mean AP falls below the threshold 90 %.Hence, it is no further possible to trade coverage for capacity.It can thus be established that the proposed optimisation technique achieves a substantial improvement of up to 10 % in the capacity i.e., mean ABR from traffic value of 1 up to 5 arrivals/sec.This analysis is further elaborated by the CDF plots of ABR of individual mobiles for the traffic values of 1, 3 and 5 arrivals/sec in Fig. 4, Fig. 5, and Fig. 6, respectively.For convenience of the readers, the average bit rate values for which the trend of the comparative graphs start to show converse behaviour have been marked as points A, B, and C in Fig. 4, Fig. 5, and Fig. 6, respectively.In Fig. 4, it is evident for   F x = 0.2, that the ABR of optimized curve is 320 Kbps less than that of non-optimized curve.On the other hand, for   F x = 0.6, the ABR of the optimized curve is observed to be 780 Kbps more compared to that of non-optimized curve.This difference is due to the fact that for smaller values of α, the α-fair scheduler assigns more resources to high SINR users at the cost of low SINR resources, to maximize the throughput.Hence, high SINR users have more bitrate as compared to low SINR users.For the traffic value of 3 arrivals per second, the network load starts to increase the global mean AP decreases.However, the global mean AP is still above the threshold of 90 %.As a result, the FQL controller decreases the value of α for the desired eNBs, to increase throughput till global mean AP threshold is not violated.The low SINR users are not as much penalized as in the case for traffic of 1 arrival/sec.Hence, for   F x = 0.2, ABR of the optimized curve is 223 Kbps less than the non-optimized curve.However, for   F x = 0.8, the ABR of the optimized curve is 1080 Kbps more than the non-optimized curve.For traffic value of 5 arrivals/sec, it can be observed that only a marginal improvement in ABR as α scheduler tends to be even more fair so that mean AP does not fall below 90 %.Whereas, the low SINR users are not penalized.

VI. CONCLUSIONS
In this paper, we have tackled the problem of coverage/capacity optimization in Self Organizing Networks.The optimal resource sharing between the mobile users has been achieved to maximize network throughput, provided the minimum coverage constraint is not violated.FQL is the optimization technique used to achieve the optimization objective.FQL is a model-less optimization technique, well suited for wireless networks with sporadic changes in mobile positions and propagation conditions etc.In the performed case study, it has been observed that the improvement in terms of mean ABR are in the order of magnitude of 10 % while network access probability does not fall below the threshold of 90 %.The case study illustrates the potential benefit of the proposed approach in the real operating networks.
t m th denotes the instantaneous throughput of the mobile m.The mean BCR of the network, denoted as network BCR

Fig. 2 .
Fig. 2. Mean Average Bit Rate as a function of the traffic intensity for auto-tuned α parameter compared with fixed α = 1.

Fig. 3 .
Fig. 3. Mean Access Probability as a function of the traffic intensity for auto-tuned α parameter compared with fixed α = 1.

Fig. 4 .
Fig. 4. CDF of the Average Bit Rate for traffic arrival rate of 1 arrival/sec.

TABLE I .
SYSTEM LEVEL SIMULATION PARAMETERS.