Load Balancing for Ultra-Dense Networks: A Deep Reinforcement Learning Based Approach

  • 2019-07-05 11:37:24
  • Yue Xu, Wenjun Xu, Zhi Wang, Jiaru Lin, Shuguang Cui
  • 0

Abstract

In this paper, we propose a deep reinforcement learning (DRL) based mobilityload balancing (MLB) algorithm along with a two-layer architecture to solve thelarge-scale load balancing problem for ultra-dense networks (UDNs). Ourcontribution is three-fold. First, this work proposes a two-layer architectureto solve the large-scale load balancing problem in a self-organized manner. Theproposed architecture can alleviate the global traffic variations bydynamically grouping small cells into self-organized clusters according totheir historical loads, and further adapt to local traffic variations throughintra-cluster load balancing afterwards. Second, for the intra-cluster loadbalancing, this paper proposes an off-policy DRL-based MLB algorithm toautonomously learn the optimal MLB policy under an asynchronous parallellearning framework, without any prior knowledge assumed over the underlying UDNenvironments. Moreover, the algorithm enables joint exploration with multiplebehavior policies, such that the traditional MLB methods can be used to guidethe learning process thereby improving the learning efficiency and stability.Third, this work proposes an offline-evaluation based safeguard mechanism toensure that the online system can always operate with the optimal andwell-trained MLB policy, which not only stabilizes the online performance butalso enables the exploration beyond current policies to make full use ofmachine learning in a safe way. Empirical results verify that the proposedframework outperforms the existing MLB methods in general UDN environmentsfeatured with irregular network topologies, coupled interferences, and randomuser movements, in terms of the load balancing performance.

 

Quick Read (beta)

Load Balancing for Ultra-Dense Networks: A Deep Reinforcement Learning Based Approach

Yue Xu*, Wenjun Xu, Zhi Wang, Jiaru Lin, Shuguang Cui*
Key Lab of Universal Wireless Communications, Ministry of Education
Beijing University of Posts and Telecommunications
*SRIBD and The Chinese University of Hong Kong, Shenzhen
Department of Electrical and Computer Engineering
University of California, Davis
Abstract

In this paper, we propose a deep reinforcement learning (DRL) based mobility load balancing (MLB) algorithm along with a two-layer architecture to solve the large-scale load balancing problem for ultra-dense networks (UDNs). Our contribution is three-fold. First, this work proposes a two-layer architecture to solve the large-scale load balancing problem in a self-organized manner. The proposed architecture can alleviate the global traffic variations by dynamically grouping small cells into self-organized clusters according to their historical loads, and further adapt to local traffic variations through intra-cluster load balancing afterwards. Second, for the intra-cluster load balancing, this paper proposes an off-policy DRL-based MLB algorithm to autonomously learn the optimal MLB policy under an asynchronous parallel learning framework, without any prior knowledge assumed over the underlying UDN environments. Moreover, the algorithm enables joint exploration with multiple behavior policies, such that the traditional MLB methods can be used to guide the learning process thereby improving the learning efficiency and stability. Third, this work proposes an offline-evaluation based safeguard mechanism to ensure that the online system can always operate with the optimal and well-trained MLB policy, which not only stabilizes the online performance but also enables the exploration beyond current policies to make full use of machine learning in a safe way. Empirical results verify that the proposed framework outperforms the existing MLB methods in general UDN environments featured with irregular network topologies, coupled interferences, and random user movements, in terms of the load balancing performance.

Deep reinforcement learning, ultra-dense networks, self-organizing networks, load balancing.

I Introduction

The fifth generation (5G) system is anticipated to meet the explosive growth of mobile data traffic, where the ultra-dense network (UDN) has been widely recognized as one of the most promising solutions [1, 2]. In UDNs, the small cell base stations (SBSs) are densely deployed to reduce the distance between the mobile users and the access points, such that the link quality could be substantially improved to increase the network capacity. However, such densification exacerbates the issue of traffic fluctuation compared with the system with large cells, which will greatly affect the network performances, e.g., the quality of service (QoS) experienced by users [3, 4]. Therefore, load balancing, which aims at alleviating traffic fluctuation, is becoming increasingly important in UDNs. On the other hand, the ultra-dense deployment of small cells brings numerous challenges in large-scale network control, communications and optimizations, where configuring the network for load balancing in UDNs is different from that in traditional cellular networks [3, 4]. For example, the traditional load balancing in cellular networks through power control [5] or changing radiation patterns [6] may change the association of all the users in the coverage area, which could downgrade the performance due to the severely coupled interference in a UDN with dense and irregular small cell topology. Thus, exploring the appropriate manner for load balancing in 5G UDNs to accommodate its irregular network topologies, complicated interference relationships, and diversified user mobility patterns still remains a challenging research problem.

Recently, self-organizing networks (SONs), which promote autonomic learning and self-management, have already been recognized as an attractive paradigm to realize adaptive management for 5G UDNs in a scalable manner [7, 8]. Therefore, it is promising to incorporate self-organizing functionalities into 5G UDNs to enable adaptive and scalable load balancing. The traditional load balancing technique with self-organizing characteristics is referred to as mobility load balancing (MLB) [9, 10, 11, 12], which usually optimizes the load distribution via tuning a parameter named cell individual offset (CIO) to control the user handovers. Current literature has proposed various MLB methods. For example, the rule-based methods map system states into pre-defined load balancing rules. Raymond et al. [10] proposed to change the CIOs via the rule with a fixed step-size based on the load difference between adjacent cells; Yang et al. [11] later proposed to use the rule with an adaptive step-size to reach load equilibrium more efficiently. However, for rule-based controllers, the predefined rules are expected to cover all possible circumstances, which requires a perfect understanding on both the system and the rules. Game-theoretic methods, on the other hand, model the MLB process as an ongoing game among cells. For example, Sheng et al. [9] modeled the MLB problem as a Cournot game; Park et al. [12] optimized the user assignment and the target cell selection based on the Kuramoto synchronization and matching theory, respectively. However, the equilibrium and optimality of game-theoretic models are usually built upon specific assumptions over the wireless gaming environment.

On the other hand, the emerging paradigm of machine learning and big data is reshaping the future of wireless networks [13, 14, 15, 2]. Reinforcement learning (RL), which is one class of machine learning methods that can adapt to unknown environments by learning from the environmental feedbacks, has already been applied in load balancing [16, 17]. However, the existing RL-based load balancing methods need to first quantize the system states and the control actions to construct a tabular environment [16, 17], which suffers from the curse of dimensionality, i.e., the numbers of actions and states increase exponentially with the degrees of freedom, thus largely hindering their applications to the complex 5G UDN environments. Noticeably, the recent combination of deep learning and RL, coined as DRL, has greatly improved the generalization capability of RL to achieve the state-of-the-art performances in many fields, such as competitive gamings [18, 19], biological data analysis [20], mobility robustness optimization [21], etc. Moreover, our previous study indicates that the DRL can also achieve state-of-art load balancing performance under a small-scale SON [22]. These successes envision a bright future to exploit DRL in realizing adaptive and autonomous large-scale load balancing under complex UDN environments, e.g., coupled interferences and irregular network topologies. However, this paradigm still remains to be explored.

I-A Contributions

In this paper, we investigate the large-scale MLB problem in UDNs with emphasis on three important features, i.e., generalization, adaptation, and scalability. The main contributions are summarized as follows.

  • Architecture: This work proposes a two-layer MLB architecture to handle the large-scale load balancing problem for UDNs in a self-organized manner. The top layer aims at dynamically grouping the underlying SBSs into self-organized clusters with adaptation to global traffic variations by using a tailored k-means algorithm, which picks the top overloaded SBSs as the initial cluster centroids and then groups SBSs based on their locations; the bottom layer aims at balancing the intra-cluster load distribution with adaptation to local traffic variations by using a DRL-based algorithm. Such a self-organized mechanism breaks the large-scale load balancing problem into smaller pieces that are easier to be handled distributively, thereby improving the system scalability and cost-efficiency.

  • Algorithm: First, this work proposes an off-policy DRL-based algorithm for MLB, which approximates the MLB policy with deep neural networks to largely improve the model generalization capability over complex UDN environments. The algorithm can be trained under an asynchronous parallel learning framework to autonomously learn the optimal MLB policy without any prior knowledge over the underlying wireless environments. Second, the proposed algorithm enables joint exploration with multiple behavior policies, such that existing MLB methods could be used to guide the learning process thereby improving the learning efficiency and stability at the early stage. Empirical results verify that the proposed DRL-based MLB model can adapt to general UDN environments featured with irregular network deployment, coupled interferences, and random user mobility patterns, and outperforms existing MLB methods considerably.

  • Mechanism: This work proposes an offline-evaluation based safeguard mechanism to improve the online control performance in practical systems. Specifically, the proposed mechanism runs an online branch for system control and an offline branch for policy learning in parallel. The online branch always uses the best MLB policy well-trained by the offline branch. As such, the mechanism can avoid the risk of exhibiting unstable or even destructive behaviors when following the under-trained policy at the early learning stage. Moreover, such a mechanism also enables policy exploration in a safe way, which is of vital importance to make full use of machine learning to go beyond existing methods for superior alternatives.

Indeed, the proposed architecture, algorithm and mechanism form a general autonomous and intelligent network control framework, which is also promising to solve other large-scale network control problems in the future systems by changing the learning context.

The remainder of this paper is organized as follows. Section II introduces the two-layer MLB architecture and the system model. Section III presents the centralized clustering algorithm. Section IV presents the self-organized DRL-based MLB algorithm. Section V presents the safeguard mechanism for online control. Section VI discusses the simulation results. Finally, Section VII concludes the paper.

II System Model

II-A Two-layer MLB Architecture for Load Balancing in UDNs

In this paper, we propose a two-layer MLB architecture to control large-scale load balancing for UDNs in a self-organized manner. As shown in Fig. 1, the two-layer architecture includes a centralized load-driven clustering controller at the top layer, and multiple self-organized load balancing controllers at the bottom layer. The top layer aims at grouping all the SBSs into a bunch of clusters according to their historical load levels, repeatedly performed on the order of hours; the bottom layer aims at balancing the load within each cluster, repeatedly performed on the order of seconds up to minutes. In this way, the top layer adapts to the dynamic global traffic fluctuations from a macroscopic view, while the bottom layer tunes the intra-cluster load distribution at a finer level.

The main advantages of the proposed architecture can be summarized as follows.

Fig. 1: A two-layer MLB architecture for UDNs
  1. 1.

    Scalability: The large-scale UDN can now be managed by multiple self-organized controllers, where each of them only needs to work with a small portion of the entire network, i.e., one cluster. As such, each controller only needs to adapt their configurations autonomously based on partial information of the network, which largely reduces the global information exchanges and is more robust to the network topology variation, thus having a strong scalability.

  2. 2.

    Efficiency: The scale of the non-convex load balancing problem increases with the the number of SBSs, which implies an increasing amount of training overhead when using learning-based algorithms. Meanwhile, the global minimum is not easy to obtain. The proposed self-organized paradigm breaks the large problem into smaller pieces that are easier to handle, such that the high computational complexity can be distributed to a bunch of self-organized controllers, which can work in parallel. In this way, the algorithm is expected to find better locally optimal solutions with improved cost-efficiency.

In this paper, we consider a MLB scenario where the SBSs are densely and randomly deployed. The overloaded SBSs handover the mobile users to their corresponding neighboring SBSs for traffic offloading by tuning the CIOs. The distributed SBSs collect the environment data, e.g., SBS loads, and deliver the data to the controllers at the bottom layer. The top layer dynamically groups the SBS into a bunch of clusters by using a load-driven clustering algorithm; the bottom layer performs self-organized intra-cluster load balancing based on partial network information by using a DRL-based learning algorithm. The detailed work-flow is shown in Fig. 2 and can be interpreted as follows. The clustering at the top layer is triggered either periodically or based on certain traffic fluctuations, e.g., the sudden increase of traffic due to some sport events, the daily tidal effect, etc. We refer to the time period between two clustering operations at the top layer as one MLB stage.

Fig. 2: Illustration of the MLB procedure based on the two-layer architecture

First, at the beginning of MLB stage k, the top layer groups all the SBSs into multiple clusters according to their averaged load levels during the previous stage. The clustering result stays unchanged over the stage. Then, the bottom layer performs intra-cluster MLB for each cluster by executing: 1) action selection, i.e., selecting the MLB actions according to the DRL-based MLB policy; 2) system control, i.e., executing the control actions for intra-cluster load offloading; 3) policy improvement, i.e., improving its DRL-based MLB policy by learning. The stage k repeats the above three procedures until the next stage k+1, i.e., until the re-clustering is triggered.

II-B Handover Management

II-B1 Handover

The handover process in MLB aims at transferring a mobile user from its serving SBS to a neighboring SBS if better signal quality can be reached. Specifically, at a given time t, the handover of one mobile user from its serving SBS i to a neighboring SBS j is triggered according to the A3 condition [23] as

Fjt-Fit>Oijt+Hys, (1)

where Fit and Fjt are the user’s reference signal received power (RSRP) of its serving SBS i and the neighboring SBS j, respectively, Hys is the handover hysteresis, which is usually a fixed value to prevent frequent handovers, and Oijt is the CIO between SBS i and SBS j. Note that Oijt is usually symmetrical, i.e., Oijt=-Ojit, to ensure that a user handed over from one SBS to another will not be handed straight back to prevent the ping-pong effect [10].

According to (1), decreasing a proper value of Oijt can trigger user handovers from SBS i to SBS j, thereby offloading the load from SBS i to SBS j, and vise versa. Therefore, the key of MLB is to find the best CIO tuning policy to trigger user handovers optimally.

II-B2 User Throughput

The signal-to-noise-plus-interference ratio (SINR) of user u in SBS i at time t is given by

SINRut=PiGuitN0+j,jiPjGujt, (2)

where is the SBS set, Pi is the transmission power of SBS i, Guit is the channel power gain between user u and SBS i at time t, and N0 is the noise power, which is assumed, without loss of generality, to be the same for all the users. We assume that the smallest resource unit to allocate is the physical resource block (PRB). For a given user u, the maximum transmission rate of one PRB at time t is given by

Rut=Blog2(1+SINRut), (3)

where B is the spectrum bandwidth of one PRB.

II-B3 Load

In this paper, we assume that each user has a constant bit rate (CBR) requirement Mut at time t. The required number of PRBs to meet the demand Mut is

Nut=min{MutRut,Nc}, (4)

where Nc is a constant threshold to keep the number of PRBs occupied by users with poor channel quality to be under a reasonable level. Finally, the load of SBS i, defined as the ratio of users’ required number of PRBs versus the total number of PRBs, can be written as

ρit=u𝒰itNutNip, (5)

where NiP is the total number of PRBs of SBS i, and 𝒰it is the set of users assigned to SBS i at time t.

II-C Preliminaries on Reinforcement Learning

II-C1 Reinforcement Learning

In this paper, we formulate the load balancing problem with RL. Specifically, RL aims at maximizing a cumulative reward by selecting a sequence of optimal actions under different system states in a stochastic unknown environment [24]. The dynamics is usually modeled as a Markov decision process (MDP), which can be characterized by a state space 𝒮, an action space 𝒜, a reward function r:𝒮×𝒜1, and a stationary transition probability satisfying the Markov property p(𝒔t+1|𝒔1,𝒂1,,𝒔t,𝒂t)=p(𝒔t+1|𝒔t,𝒂t), where 𝒔𝒮,𝒂𝒜. Specifically, at each state 𝒔t𝒮, the RL agent selects an action 𝒂t𝒜 by following a policy π to interact with the environment, and receives a reward r(𝒔t,𝒂t); then the state st moves on to 𝒔t+1 and the system dynamics starts the next round.

II-C2 Value Function

In RL, the value function for a given policy π at state 𝒔 is defined to be the received long-term expected cumulative rewards starting at state 𝒔 and following the policy π thereafter [24], i.e.,

Vπ(𝒔)=𝔼π[t=0γtr(𝒔t,𝒂t)|𝒔t=𝒔], (6)

where γ[0,1] is the discount factor. Similarly, the Q-function for a given policy π when choosing action 𝒂 at state 𝒔 is defined as

Qπ(𝒔,𝒂)=𝔼π[t=0γtr(𝒔t,𝒂t)|𝒔t=𝒔,𝒂t=𝒂]. (7)

Generally, the goal of RL is to find the optimal policy π* which maximizes the cumulative discounted rewards from the start state 𝒔0 [24].

II-C3 Off-policy Learning

The learning and exploration in RL usually involve two policies: the target policy and the behavior policy. The target policy is optimized by using the samples generated by following the behavior policy. In practice, the target policy is usually used for system control, while the behavior policy is used for exploration. In the literature, there are two classes of learning methods in RL [24], i.e., the on-policy methods and the off-policy methods. The on-policy methods use the same policy as the target policy and the behavior policy, in other words, using the same policy for both system control and exploration. The off-policy methods, on the other hand, separate the target policy from the behavior policy, where the target policy is optimized by using the samples generated from the behavior policy based on the importance sampling technique [24].

In this paper, we focus on the off-policy RL for MLB with the following motivations. First, it would be problematic or even dangerous to apply the on-policy methods to online systems, since it requires online exploration, i.e., performing random actions for system control in order to generate randomized learning samples. Comparatively, with off-policy learning, we can employ a deterministic policy for online control while following a different stochastic behavior policy to explore learning samples offline, e.g., by using a network evaluation system, thus avoiding the risk of online exploration. Second, the samples generated from different behavior policies or in previous explorations can be reused in the off-policy RL, which is the foundation for our later proposed learning scheme under multiple behavior policies.

II-D Problem Formulation

The clustering at the top layer partitions the large-scale load balancing problem into multiple small intra-cluster load balancing problems, where each load balancing controller at the bottom layer only needs to tune the CIOs of its corresponding intra-cluster SBSs.

In this paper, we model the MLB problem in UDN as a MDP and exploit the off-policy RL to learn the optimal intra-cluster CIO tuning policy. Specifically, the SBS load distribution and user distribution are defined as the state, the change of CIO values is defined as the action, and the load balancing utility is defined as the reward. At each time t, the bottom layer selects the MLB action 𝒂t according to the observed UDN state 𝒔t, which shifts the system to the next state 𝒔t+1 according to the state transition probability p(𝒔t+1|𝒔t,𝒂t). Note that the transition probability is conditioned on both the current UDN state 𝒔t and the MLB action 𝒂t, which indicates that the MLB actions affect not only the immediate MLB reward but also the next UDN state and, through that, all subsequent MLB rewards. Compared with the traditional formulations which usually aim at maximizing the load balancing utility at a specific time slot, the RL-based formulation is more far-sighted, aiming at achieving a balanced load distribution over a long time horizon. Moreover, the RL-based formulation learns from the incrementally generated data samples when interacting with the UDN environment for MLB, thereby bearing a built-in data-driven learning nature. The detailed learning context is defined as follows.

II-D1 States

The inputs of the RL-based learning algorithm are the wireless system status. However, directly using the raw system data as the RL inputs is problematic: first, the state space could be too large to enumerate; second, the high-dimensional raw data is computationally challenging to process and time costing to transmit; third, the raw data may contain too much redundancy to negatively impact the learning performance. Therefore, in this paper, we propose to use a collection of high-level features to represent the wireless system status as the RL inputs. Specifically, the state for each SBS i is composed of: i) the SBS load derived from the averaged load ρ~i=ρi-ρg, where ρg=1Ni=1Nρi with N denoting the number of SBS; ii) the fraction of the edge users Ei. Here we categorize the edge users according to their downlink SINRs to the corresponding serving SBS and the neighboring SBSs. Generally, one SBS with more edge users indicates that its handover is more sensitive to the change of CIOs. In summary, the state vector could be written as

𝒔t=[ρ~1t,ρ~2t,,ρ~Nt,E1t,E2t,,ENt]. (8)

It is noteworthy to point out that the feature selection is not unique; other combinations could be explored in the future.

II-D2 Actions

The output of the RL-based learning algorithm is the CIO values, i.e.,

𝒂t={Oij(t)|i,j}. (9)

Here we have Oij[Omin,Omax],i,j, where Omin and Omax are the predefined lower bound and upper bound, respectively. Note that, in this paper, in order to study a general control problem, we consider the CIO value to be continuous.

II-D3 Rewards

The load distribution could be balanced by minimizing the maximum load (or equivalently maximizing the inverse of the maximum load) of all the SBSs [12], i.e., alleviating the worst case. Hence, we define the reward function to be

r(𝒔t,𝒂t)=1maxiρit. (10)

II-D4 Policy

The policy in RL is usually modeled as a stochastic function π:𝒮𝒫(𝒜), which characterizes the probability of selecting an action at𝒜 at an arbitrary state st𝒮. However, in this paper, we consider a deterministic policy for CIO tuning in practice, which can be parameterized as

πθ:𝒮𝒜, (11)

where θn is the parameter to optimize. Particularly, in this paper, the policy is parameterized by the deep neural network to improve the model generalization capability, which is discussed in Sec. IV-D.

II-D5 Objective

Our goal is to find a CIO tuning policy which can maximize the cumulative discounted MLB reward from the start state s0. The objective function can be written as

Jβ(πθ) =𝒮κβ(s)Vπθ(s)𝑑s (12a)
=𝒮κβ(s)𝒜π(s,a)Qπθ(s,a)𝑑a𝑑s (12b)
=𝒮κβ(s)Qπθ(s,πθ(s))𝑑s (12c)
=𝔼sκβ[k=0γkr(s,πθ(s))], (12d)

where β:𝒮𝒫(𝒜) is a stochastic behavior policy, and

κβ(s):=𝒮t=1γt-1p1(s)p(ss,t,β)ds, (13)

is the discounted state visitation distribution [25] when following the behavior policy β with p1(s) the probability of the starting state and p(ss,t,β) the transition probability from state s to state s under policy β at time t. Generally, the objective can be viewed as the value function or the Q-function of the target policy averaged over the state distribution of the behavior policy [25]. Finally, the off-policy RL-based intra-cluster MLB problem could be written as

𝒫0: maxθJβ(πθ) (14)
s.t. C1:𝒳u,it{0,1},i𝒳u,it1,u𝒰, (15)
C2:Oij[Omin,Omax],i,j, (16)

where 𝒳u,it defines the uniqueness of the user-to-SBS association, C1 defines the uniqueness of user association, and C2 defines the bound over the CIO values.

III Centralized Dynamic Load-Driven Clustering

In our proposed architecture, the top layer aims at dynamically grouping all the SBSs into a bunch of clusters according to the global traffic variations. The clustering can be triggered either periodically or based on certain traffic patterns, e.g., the sudden increase of traffic due to some sport events, the daily tidal-effect, etc. In this paper, we tailor the celebrated Lloyd’s k-means clustering method [26] to perform load-driven clustering for the underlying SBSs, which is fast to converge in practice. The proposed clustering algorithm consists of two phases: 1) load-based initialization, which is based on the detection of traffic hotspots, i.e., the top overloaded SBSs; 2) location-based SBS clustering, which aims at grouping the hotspot SBSs with their nearby SBSs based on the geographical distance.

III-1 Load-based Initialization

The selection of initial cluster centroids can largely influence the clustering result. A commonly used strategy is to randomly choose a bunch of SBSs as the initial centroids [27]. Based on the domain knowledge, in this paper, we instead propose to choose the top k overloaded SBSs as the initial centroids in order to efficiently offload the traffic from the hotspot SBSs for load balancing. First, we define the stage-averaged load to be the averaged SBS load over the previous MLB stage. Specifically, for each SBS i, we define its stage-averaged load over the previous stage k-1 as

ρ^ik-1=1Tk-1t=tk-10tk-10+Tk-1-1ρit, (17)

where tk-10 is the start time of stage k-1, and Tk-1 is the time length of stage k-1. The stage-averaged load can reflect a fair load level by reducing the influence from abnormal traffic variations. Next, we rank the SBSs according to their stage-averaged loads to obtain a sorted list 𝑪list as the initial cluster centroids.

III-2 Location-based Clustering

Given the list of candidate centroids from the initialization phase, we next group the SBSs according to their geographical locations by minimizing the sum of squared error (SSE):

min𝒞h,𝒄hh=1H𝒙i𝒞h𝒙i-𝒄h22 (18)
s.t. 𝒳i,ht{0,1},h𝒳i,ht1,i𝒞h, (19)

where H is the predetermined number of clusters, 𝒞h is the set of SBSs that are assigned to the cluster h, 𝒙i is the location of SBS i, 𝒳i,ht defines the uniqueness of the SBS-to-cluster association, and

𝒄h=1|𝒞h|𝒙i𝒞h𝒙i, (20)

is the centroid location of cluster 𝒞k with || the cardinality. The detailed procedure is provided in Algorithm III-2.

Note that the selection of H is also critical. The clustering validation methods, e.g., the Calinski-Harabasz index [28], can be employed to evaluate all the possible H and choose the best one. For example, we can first set a searching range for the candidate H, e.g., ={1,2,,Hmax} where Hmax is the upper bound; then, we evaluate the Calinski-Harabasz index of the clustering result with each h{1,2,,Hmax}; finally, we choose the one with the highest Calinski-Harabasz index as the optimal choice.

It is noteworthy to remark that our proposed k-means algorithm is only one possible solution for the dynamic clustering at the top layer; other explorations could be done in the future with different preferences, e.g., by considering user mobility pattern, social relationships, radio propagation, etc. {algorithm} Load-driven Clustering Based on K-means \[email protected]@algorithmic[1] \STATEInput: \STATE t; ρit; 𝒙i; number of clusters H; number of SBSs N. \STATEInitialization: \STATE Calculate the stage-averaged loads according to (17). \STATE Rank the stage-averaged loads to obtain 𝑪list. \STATE Initialize the cluster centroids with the top H elements from 𝑪list. \STATEIteration: \REPEAT\FORi=1,2,,N \FORh=1,2,,H \STATE Update the SBS-to-cluster association by

𝒳i,ht=argmin𝒙i-𝒄h22
\ENDFOR\ENDFOR\FOR

h=1,2,,H \STATEUpdate the cluster centroids by

𝒄h=i=1N𝒳i,ht𝒙ii=1N𝒳i,ht
\ENDFOR\UNTIL

No change of the cluster centroids.

IV Self-Organized DRL-Based Load Balancing

The bottom layer aims at balancing the load within each cluster with the proposed off-policy DRL-based MLB algorithm. Generally, the classic off-policy method optimizes the target policy by following a single behavior policy for exploration. However, it is typical to assume that the behavior policy must allow the RL agent to explore every possible state infinitely often [24]. To this end, the commonly adopted behavior policy should be a fully randomized one, such that the explored high-quality samples are usually very sparse, which makes the learning inefficient. Therefore, in this paper, we propose to adopt multiple behavior policies for joint exploration, which covers the use of a single behavior policy as a special case, and has the following advantages.

  1. 1.

    Stability: The learning now only requires the jointly explored space of multiple behavior policies to cover the entire state-space. Hence, the learning over multiple behavior policies can perform well even though none of the behavior policies can lead to the optimal solution on its own.

  2. 2.

    Efficiency: The explored samples by following different behavior policies are richer than simply following one behavior policy, which implies a higher probability to find high-quality samples. Moreover, the behavior policies could include traditional MLB methods to directly generate high-quality samples to realize expert-guided or model-assisted learning, which could improve the learning efficiency.

Note that the learning under multiple behavior policies can be implemented for practical systems in many ways. For example, 1) the self-organized cluster could use different behavior policies in turn, e.g., using traditional MLB methods to guide learning at the beginning and switching to random exploration later; 2) different self-organized clusters under similar MDPs, e.g., due to similar cell layouts and user mobility patterns, can learn jointly, where each of them follows a different behavior policy and updates a shared target policy.11 1 In this case, the learning under multiple behavior policies can be considered as a simple multi-agent learning paradigm.; 3) the target policy can be trained offline, e.g., in a simulation platform, by following multiple behavior policies, and then be used for online control.

In the sequel, we first extend the policy gradient theorem for the case with multiple behavior policies, which is the key for policy improvement; then we present a parallel learning framework. Finally, we empower the learning framework with deep neural networks to largely improve its generalization capability in complex UDN environments.

IV-A Policy Gradient with Multiple Behavior Policies

IV-A1 Preliminaries on Policy Gradient

Traditional policy improvement methods find a better policy by greedily choosing the best action under all possible states according to the estimated value function or Q-function [24]. However, such a greedy searching strategy becomes computationally demanding when the action space is large, and usually impossible for a continuous action space. Therefore, in this paper, we exploit the policy gradient method for policy improvement, which overcomes the limitations of the greedy searching strategy by explicitly optimizing a parameterized policy. Specifically, the policy gradient theorem [24] gives the analytic expression for the gradient of the objective J(π𝜽) with respect to the policy parameters 𝜽, written as

𝜽J(π𝜽)=𝔼𝒔dπ𝜽,𝒂π𝜽[𝜽logπ𝜽(𝒂|𝒔)Qπ𝜽(𝒔,𝒂)], (21)

where dπ𝜽 is the state distribution when following policy π𝜽. Later, Silver et al. proposed the off-policy deterministic policy gradient (OPDPG) theorem [25] to optimize a deterministic target policy by following a single stochastic behavior policy, written as

θJβ(πθ) 𝒮ρβ(s)θπθ(s)aQπ(s,a)|a=πθ(s)ds (22a)
=𝔼sρβ[θπθ(s)aQπ(s,a)|a=πθ(s)], (22b)

where the approximation comes from the drop of one term depending on the Q-value gradient θQπ(s,a) to preserve the convergence [25].

IV-B OPDPG with Multiple Behavior Policies

The OPDPG theorem can be adopted to optimize our objective function with a proper extension on the learning over multiple behavior policies. Specifically, we denote the set of multiple behavior policies to follow as ={β1,β2,,βM}, where M is the total number of behavior policies. Based on the objective under a single behavior policy given in (12), the objective under multiple behavior policies could be written as

J(πθ)=mJβm(πθ), (23)

where

Jβm(πθ)=𝔼sκβm[k=0γkr(s,πθ(s))], (24)

with κβm(s):=𝒮t=1γt-1p1(s)p(ss,t,β)ds. The new objective J(πθ) can be viewed as the value function of the target policy averaged over the states generated by following multiple behavior policies. The intra-cluster MLB problem now becomes

𝒫1: maxθJ(πθ)=mJβm(πθ) (25)
s.t. C1:𝒳u,it{0,1},i𝒳u,it1,u𝒰,
C2:Oij[Omin,Omax],i,j.

Accordingly, the policy gradient for the learning under multiple behavior policies is given as

θJ(πθ)=mθJβm(πθ) (26a)
m𝒮ρβm(s)θπθ(s)aQπ(s,a)|a=πθ(s)ds (26b)
=m𝔼sρβm[θπθ(s)aQπ(s,a)|a=πθ(s)], (26c)

where in (26b), we apply the OPDPG theorem for each behavior policy, respectively.

Fig. 3: Overview of the DRL architecture for MLB
(a) Actor-Critic
(b) Explore With Multiple Behavior Policies
Fig. 4: Parallel learning framework

IV-C Parallel Learning Framework

We now develop a parallel learning framework based on the actor-critic algorithm [24], where each agent follows a different behavior policy for exploration, while updating a shared target policy simultaneously.

IV-C1 Preliminaries on Actor-Critic

The actor-critic algorithm [24], which comprises an actor and a critic learner, is one of the most popular learning algorithms based on the policy gradient methods. As shown in Fig. 4(a), the critic estimates the true Q-function Qπ𝜽(𝒔,𝒂) of the actor’s policy π𝜽 with a parameterized function Q𝒘(𝒔,𝒂), where 𝒘n is the parameters to be learned based on the policy evaluation methods [24]. The actor, on the other hand, improves its policy π𝜽 with the estimated Q-function Q𝒘(s,a) based on the policy gradient methods. In this paper, we use Q-learning for policy evaluation in the critic, while using the extended OPDPG in (26) for the policy gradient in the actor. The critic and the actor will iteratively improve the estimated Q-function Q𝒘(𝒔,𝒂) and the policy π𝜽 until convergence.

IV-C2 Parallel Learning Framework with Multiple Behavior Policies

As shown in Fig. 4(b), we employ multiple actor-critic RL agents to learn in parallel. Generally, each RL agent interacts with the environment by using a different behavior policy and transmits the calculated gradients to a centralized parameter server. The parameter server uses the local gradients to update a set of global parameters and periodically synchronize with the local parameters for consensus. The above procedures are performed repeatedly until convergence or some termination criteria are met. In particular, the detailed work-flow for the iteration at time t is presented as follows.

First, for each agent m at time t, the local actor observes the current state st, and executes an action atm by following its own behavior policy with atm=βm(stm). Afterwards, the current state stm shifts to st+1m and returns a reward rtm. The generated transition at time t can be written as (stm,atm,st+1m,rtm).

Second, each local critic evaluates the Q-function of the target policy based on Q-learning [24]. Specifically, the gradient Δwm for the Q-function updates can be written as

δtm =rtm+γQw(st+1m,πθ(st+1m))-Qw(stm,atm), (27)
Δwm =δtmwQw(stm,atm), (28)

where δtm is the temporal difference (TD) error, with πθ(st+1m) the action selected by the target policy πθ at state st+1m. Note that each local critic only needs to submit the gradient to the parameter server, without applying to the local Q-function.

Third, each local actor computes the policy gradient according to the extended OPDPG theorem in (26). The gradient Δθm for the target policy updates can be written as

Δθm=θπθ(stm)aQw(stm,a)|a=πθ(stm). (29)

Similarly, each local actor only needs to submit the gradient to the parameter server, without applying to the local policy.

Finally, the parameter server uses the collected gradients to update a set of global parameters as

wt+1g =wtg+m=1MαwΔwm, (30)
θt+1g =θtg+m=1MαθΔθm, (31)

where αw and αθ are the step-sizes. Afterwards, the local parameters wtm and θtm of all the agents m (including all the actors and critics) are synchronized with the global parameters wt+1g and θt+1g for consensus, which completes this iteration.

IV-C3 Convergence Analysis

Suppose that p(s|s,a), ap(s|s,a), μθ, θμθ(s), r(s,a), ar(s,a), p1(s) are continuous in variables s,a and s, and that the state space 𝒮 is a compact subset of d, the convergence of the proposed parallel actor-critic learning framework can be reached when we use: i) compatible linear Q-function approximators [25]; ii) the gradient temporal-difference (GTD) based learning method [29] for the critic updates.

The justification follows the one in [25]. First, the use of a compatible linear Q-function approximator ensures that the deterministic policy gradient will not be affected when using the estimated gradient aQw(s,a) to replace the true gradient aQμ(s,a). In particular, a function approximator that is compatible with the deterministic policy μθ(s) is defined to satisfy [25]:

  1. 1.

    aQw(s,a)|a=μθ(s)=θμθ(s)w;

  2. 2.

    w minimizes the mean-square error MSE(θ,w)=𝔼sρβm[ϵ(s;θ,w)ϵ(s;θ,w)], where ϵ(s;θ,w)=aQw(s,a)|a=μθ(s)-aQμ(s,a)|a=μθ(s).

Therefore, we have

wϵ(s;θ,w)=waQw(s,a)|a=μθ(s)=θμθ(s). (32)

If w can minimize the MSE, the gradient of MSE w.r.t. w should satisfy

wMSE(θ,w) =2𝔼sρβm[wϵ(s;θ,w)ϵ(s;θ,w)] (33a)
=2𝔼sρβm[θμθ(s)ϵ(s;θ,w)]=0. (33b)

According to the definition of ϵ(s;θ,w), we have

𝔼sρβm[θμθ(s)aQw(s,a)|a=μθ(s)] (34a)
= 𝔼sρβm[θμθ(s)aQμ(s,a)|a=μθ(s)] (34b)
= θJβm(μθ), (34c)

which proves that the estimated gradient aQw(s,a) can replace the true gradient aQμ(s,a) without affecting the deterministic policy gradient. Note that the above proof applies to any of the multiple behavior policies βm.

Second, the GTD based policy evaluation has been proven to have sure convergence for the case with multiple behavior policies when using true gradients [29], which herein is also applicable when using estimated gradients based on the above proof. Hence, we can use the extended OPDPG theorem in (26) for the policy gradient, and use the diffusion off-policy GTD in [29] for the policy evaluation to ensure the convergence. Note that, according to [29], the actor and the critic should be updated with a sufficiently small step-size, in order to minimize the mean-squared projected Bellman error (MSPBE).

In order to achieve a better generalization capability, in the following part, we further apply the nonlinear deep neural networks as the function approximators; and use Q-learning [24] for critic updates, instead of the diffusion off-policy GTD [29], to reduce the training complexity. Consequently, the convergence is no longer theoretically guaranteed, which is a common issue for deep learning based models. But the empirical results show that our proposed parallel actor-critic learning framework with deep neural networks can still converge in practice.

IV-D Fueled With Deep Neural Networks

{algorithm}

DRL-based MLB With Parallel Learning Framework \[email protected]@algorithmic[1] \STATEInput: \STATEfor m, αwm=10-3, αβm=10-4; γ=0.99; τ=0.001; K=64. \STATEInitialization: \STATERandom initialize the weights w and θ; for m, set wmw,θmθ, w^mwm,θ^mθm. \STATEIteration: \STATEObserve the start state s1m. \FORt=1, \FORm=1,M \STATESynchronize wmw,θmθ. \STATEUpdate the target networks with (37) and (38). \STATEObserve stm and execute atm=βm(at|st). \STATEReceive rtm and shift to the next state st+1m. \STATEStore the transition (stm,atm,rtm,st+1m) in 𝒟m. \STATERandomly select K transitions from 𝒟m. \STATECalculate the local Δwm for critic with (39). \STATECalculate the local Δθm for actor with (40). \STATESend Δwm and Δθm to the parameter server. \ENDFOR\STATEUpdate the global w and θ with (30) and (31). \ENDFOR

Theoretical convergence and sample-complexity analysis in RL have been mainly developed when using tabular or linear functions to approximate the Q-functions [24]. The use of non-linear function approximators, e.g., neural networks, has long been considered to be unstable or even leading to divergence in RL due to the strong correlations among the generated samples [30, 31, 32, 33]. However, recent successes on DRL have proven that using deep neural networks as function approximators is feasible when adopting the following two learning strategies [30]: 1) using mini-batch learning along with an experience replay, which can randomize the generated data samples and smooth the changes over data distributions; 2) using guiding Q-networks, which can reduce the correlation between the learning Q-value Q(s,a) and the guiding Q-value r(st,at)+γQ(st+1,at+1), when deriving the gradients. In the sequel, we empower the proposed algorithm with the powerful deep neural networks, which are trained based on the aforementioned two learning strategies.

IV-D1 Deep Architecture

We use two deep neural networks to respectively represent the policy function in the actor and the Q-function in the critic to improve the generalization capability for MLB, which are referred to as the actor-net and critic-net, respectively. The high-level structure is presented in Fig. 3. Specifically, for the actor-net, the policy network takes the system states (the high-level features defined in Sec. II-D) as the inputs, and outputs the CIOs. For the critic-net, the Q-network takes both the system states and the output CIOs from the policy network as the inputs, and generates a scalar Q-value as the final output. The neurons are fully connected.

Fig. 5: Safeguard mechanism for online control

IV-D2 Guiding Networks

We use the guiding networks to stabilize the learning process in the actor and the critic. Specifically, we introduce two more deep neural networks with the same structure as the critic-net and actor-net, respectively, to be the guiding critic-net Qw^m(s,a) and the guiding actor-net πθ^m(s). The aim of the guiding network is to provide a stabilized learning target when updating the critics. In particular, according to the critic update rules in (27) and (28), the loss function to minimize is given by

(wm)=𝔼(si,ai,ri,si+1)𝒟m[(yim-Qwm(sim,aim))2], (35)

where

yim=r(sim,aim)+γQw^m(si+1m,πθ^m(si+1m)). (36)

Here yim serves as the learning target for the critic-net. However, different from (27) and (28), the estimated Q-value in yim is now generated by the guiding critic-net, with the action predicted by the guiding actor-net, i.e., Qw^m(si+1m,πθ^m(si+1m)). The parameters of the guiding networks should slowly track the parameters of the learning networks, such that yim becomes a stabilized learning target for the critic-net. In this way, the training of the critic-net is closer to a supervised learning problem, where a robust solution likely exists. In practice, the guiding parameters w^m and θ^m can be updated as

w^mt =τwmt+(1-τ)w^mt, (37)
θ^mt =τθmt+(1-τ)θ^mt, (38)

where τ1 is a fixed step-size.

IV-D3 Mini-batch Learning

We use the mini-batch gradient descent to train the deep neural networks, which is a widely used technique in deep learning. Generally, in mini-batch learning, the training dataset are first split into multiple mini-batches of certain size. The gradients over one mini-batch are summed or averaged to reduce the variance of the gradient. In this way, the mini-batch gradient descent is promising to find a balance between the efficiency of stochastic gradient descent and the robustness of full-batch gradient descent. In our model, each local agent m has a local experience replay 𝒟m to store the generated transitions (stm,atm,rtm,st+1m) incrementally. At each time t, for each local agent m, we select K transitions randomly from its local experience replay to form one mini-batch. The gradients to minimize the loss function (wm) are averaged as

Δwm=1Ki=1K[(yim-Qwm(sim,aim))wmQwm(sim,aim)]. (39)

Similarly, the gradients to update the target policy are averaged as

Δθm=1Ki=1Kαθθmπθm(sim)aQwm(sim,a)|a=πθ(sim). (40)

IV-D4 Asynchronous Update

Based on the proposed parallel actor-critic learning framework, we use a parameter server to collect the mini-batch gradients Δwm and Δθm from the local agents to update a set of global parameters, and synchronize the local w and θ of each agent with the global one for consensus. Note that the training of the deep neural networks can be performed in an asynchronous manner [32, 33], which relaxes the coordination requirements. However, the received gradients exceeding a preset maximum time delay should be filtered out to ensure the learning stability [32, 33].

The complete procedure of our proposed DRL-based intra-cluster MLB is presented in Algorithm IV-D.

V Offline-Evaluation Based Online Control

One of the major drawbacks of using the data-driven machine learning based method for online control is that the system needs to take the risk of applying under-trained policies for control at the early learning stage, which is unstable and even dangerous to the network, e.g., following an under-trained MLB policy may lead to heavily overloaded situations due to random handovers. On the other hand, training the neural networks to re-adapt to the underlying UDN environments after each re-clustering process requires a certain amount of time, which may disturb the online MLB performance. Therefore, in this section, we propose an offline-evaluation based safeguard mechanism which runs an online branch for system control and an offline branch for policy learning in parallel. The online branch always uses the best MLB policy found and well-trained by the offline branch. As such, the mechanism not only stabilizes the online performance, but also enables the exploration beyond-current or even over unexpected policies in a safe way. The philosophy behind is of vital importance to make full use of data-driven machine learning, which goes beyond existing methods to deliver superior alternatives.

Specifically, we run the proposed MLB algorithm over the offline learning branch and the online controlling branch in parallel. The learning branch runs in an offline manner, e.g., by using a network evaluation system based on historical performance records; the controlling branch runs in the online system, which only uses the best MLB policy found by the offline branch. For example, as shown in Fig. (5), the work-flow at stage k can be interpreted as follows. First, at the beginning of stage k, the offline learning branch triggers the re-clustering process, and trains the neural networks to adapt to the new clustering result. Then, at the end of stage k, the safeguard evaluates the performance of the newly learned MLB policy over the previous stage k. If the newly learned policy can generate a better performance, the safeguard would replace the online policy with the new policy; otherwise, it keeps the current online policy in the online branch and continues to seek a better one in the offline branch at the next stage. In this way, we can keep exploring better MLB policies in the offline branch, while only executing the best and well-trained policy online, thus ensuring the online system to operate with a stable MLB policy.

VI Simulation Results

In this section, we evaluate the performance of the proposed two-layer MLB architecture along with the DRL-based MLB algorithm through simulations. We choose four baseline methods: 1) the static rule-based algorithm [10], which uses a fixed step-size to tune the CIOs; 2) the adaptive rule-based algorithm [11], which uses an adaptive step-size to tune CIOs; 3) the Q-learning based algorithm [17], which uses Q-learning to tune the CIOs between the most overloaded SBS and its neighboring SBSs; 4) a plain baseline without any MLB controls. We refer to them as the rule-based (static) algorithm, the rule-based (adaptive) algorithm, the Q-learning algorithm, and the noMLB algorithm, respectively.

The simulated SON scenario consists of 12 SBSs randomly distributed in a 300m×300m area with 200 users randomly walking at 1m/s10m/s, each incurring a CBR traffic demand. The transmit power of each SBS is set to be 46dBm. The path loss from each particular user to one particular SBS is modeled as 128.1+37.6log(max{d,0.035}), where d is the distance in km. The extra log-normal shadowing is modeled with a zero mean and a standard deviation of 8dB. The handover hysteresis is set to be 3dB. Both of the actor-net and critic-net are composed with two hidden layers, each with 400 and 300 neurons, respectively. All hidden layers are followed by a rectified non-linearity layer. The learning rate for actor and critic networks are fixed as 10-4 and 10-3, respectively. The discount factor γ for rewards is set to be 0.99. The τ value for the soft target network update is set to be 0.001. The size of the mini-batch is set to be 64. The size of each local experience replay is set to be 105. The simulations are performed on a workstation with eight Intel Xeon E3 CPU cores at 3.50 GHz and NVIDA GeForce GTX 970 with 4G graphics memory, Generally, the training of one DRL model with 10,000 time steps can usually be completed within 30 minutes.

We consider three MLB performance indicators: 1) the reward function value; 2) the load standard deviation; 3) the handover failure ratio (HFR). Specifically, the HFR is obtained by introducing an admission control mechanism [10, 11] to block the incoming handover attempts if the SBS load exceeds 80%, concretely,

F=NHOfailNHOfail+NHOsuccess, (41)

where NHOfail is the number of blocked handover attempts of all the SBSs and NHOsuccess is the number of successful handover attempts of all the SBSs. Moreover, the presented rewards are moving averaged to smooth out short-term fluctuations and highlight longer-term trends, thus presenting a clear and sharp comparison. Specifically, the averaged reward at time t is given by

r¯(t)=1TAi=t-TA+1tr(i), (42)

where TA is a fixed subset size, set to be 200 in this paper.

(a) Normalized performance gain
(b) Averaged rewards with six SBSs
Fig. 6: Performance under different number of SBSs

The result in Fig. 6 presents the scalability of the comparing schemes under the CBR of 96kbps. The result is averaged over 10,000 time steps under 30 different randomized SBS layouts to present a fair and general comparison. Fig. 6(a) presents the performance gain by using the noMLB method as the reward baseline. The result shows that that when performing MLB among only three SBSs, the centralized architecture generates a better performance due to global optimization. However, when performing MLB among more than three SBSs, the two-layer architecture outperforms the centralized architecture with a growing superiority. This verifies the scalability of our proposed two-layer architecture, and validates that the self-organized MLB is promising to find a better (local optimal) solution than the fully centralized MLB by breaking the large non-convex MLB problem into smaller pieces that are easier to handle, as discussed in Sec. II-A. Meanwhile, the rule-based (static) and the rule-based (adaptive) algorithms are also scalable since they are both fully distributed algorithms. However, their performance is worse than our proposed algorithm due to the lack of cooperations. Fig. 6(b) presents the detailed averaged rewards obtained with six cells, which reveals that compared with the centralized MLB, the performance of the self-organized MLB is more stable and can converge remarkably faster at the early stage.

Fig. 7: Performance of learning under multiple behavior policies

The result in Fig. 7 shows the moving averaged rewards of all the comparing schemes over 4,000 time steps under a CBR requirement of 112 kbps. The performance is averaged over 30 different and random SBS topologies. The result shows that the proposed DRL-based algorithms can outperform the competitors considerably. Specifically, for our proposed DRL-based algorithm, we study two alternatives: 1) using a single behavior policy that is constructed by adding certain random noises to the target policy; 2) using three behavior policies, including the above noisy policy and two rule-based policies. We refer to them as the DRL-SBP algorithm and the DRL-MBP algorithm, respectively. The DRL-SBP algorithm is trained centrally, while the DRL-MBP algorithm is trained in parallel. The noMLB baseline in Fig. 7 indicates that the maximum SBS load fluctuates around 74% if no MLB attempts are made. For the rule-based algorithms, both of the rule-based (static) and rule-based (adaptive) algorithms can control the maximum SBS load to be under 67%. However, the rule-based (adaptive) algorithm can perform slightly better than the rule-based (static) one. For the Q-learning algorithm, due to the quantization of states and actions, the learned policy can hardly be applied to general UDN environments, which results in the same performance as the noMLB baseline. For our proposed DRL-based algorithm, both of the DRL-SBP algorithm and the DRL-MBP algorithm can control the maximum SBS load to be under 55%, where the DRL-MBP algorithm can converge remarkably faster at the early stage and performs slightly better after the convergence.

(a) Reward Without Safeguard
(b) Reward With Safeguard
Fig. 8: Performance of the safeguard mechanism

The result in Fig. 8 shows the rewards obtained by using the proposed safeguard mechanism. The simulation is performed over 50,000 time steps, where the re-clustering is simply triggered with a period of 10,000 time steps, i.e., about 3 hours. Fig. 8(a) shows the performance achieved without using the safeguard mechanism. It is clear that the re-clustering operation has a strong impact over the MLB performance. Fig. 8(b) shows the performance achieved by using the safeguard mechanism, which can be interpreted as follows. First, at the beginning of the first stage, the offline branch and online branch are initialized with the same parameters and start to learn. Next, at the beginning of the second stage, the offline branch triggers re-clustering, and starts to explore a new MLB policy. However, the new policy is worse than the online policy, which is therefore discarded. Then, at the beginning of the third stage, the offline branch starts a new exploration and successfully finds a new policy that is better than the the online policy. Hence, the safeguard replace the online policy with the new policy along with the well-trained parameters at the beginning of the forth stage. In this way, the MLB performance of the online branch is ensured to be stable and non-decreasing as presented in Fig. 8(b).

(a) Handover Failure Rate
(b) Load Standard Deviation
Fig. 9: Global load distribution

The result in Fig. 9 reflects the global load distribution over the entire area. Specifically, Fig. 9(a) presents the HFR under different load burdens by varying the CBR from 48kbps to 112kbps. Generally, a lower HFR means that the SBSs are less likely to be overloaded. The result in Fig. 9(a) shows that the HFO gradually increases with the increase of offered loads for all schemes. However, the proposed DRL-based algorithm can always achieve a lower HFR compared with the other methods. The result in Fig. 9(b) shows the load standard deviations under different load burdens. Generally, a lower load standard deviation means that the loads are more equally distributed among SBSs. The result in Fig. 9(b) shows that the proposed DRL-based algorithms can achieve low standard deviations as well. Additionally, the superiority of our proposed algorithms increases along with the load burdens. The HFR and load standard deviation results jointly verify that our proposed algorithm can achieve a more balanced global load distribution than the competitors.

VII Conclusion

In this paper, we proposed a DRL-based MLB algorithm along with a two-layer MLB architecture to solve the large-scale load balancing issue in UDNs. The proposed two-layer architecture can handle the large-scale MLB problem of UDNs in a self-organized manner, which makes it scalable to the number of SBSs. The proposed off-policy DRL-based algorithm can be trained via an asynchronous parallel learning framework and employ multiple behavior policies for joint exploration to improve the learning efficiency. Moreover, we proposed an off-policy evaluation based safeguard mechanism to improve the online control performance by ensuring that the online UDN system always operate with the optimal and well-trained MLB policy. Simulation results verified that 1) the proposed algorithm could outperform existing algorithms considerably, in terms of the MLB performance; 2) the proposed two-layer architecture achieves nice scalability over the number of SBSs, and outperforms the centralized architecture when dealing with a large number of SBSs; 3) the learning under multiple behavior policies converges remarkably faster than the learning under a single behavior policy; 4) the safeguard mechanism ensures the online performace to be stable and non-decreasing. Moreover, the proposed architecture, algorithm and mechanism are also promising to be applied over other large-scale network control problems in the future networks. For example, for the access control problem in the IoT system studied in [34], one could first cluster the IoT devices into different groups and then determine their access control policies respectively. Meanwhile, the strategy of learning under multiple behavior policies and the safeguard mechanism can be exploited to further improve the learning efficiency and the online performance. More interesting applications can be explored in the future.

References

  • [1] C. Pan, M. Elkashlan, J. Wang, J. Yuan, and L. Hanzo, “User-centric C-RAN architecture for ultra-dense 5G networks: Challenges and methodologies,” IEEE Commun. Mag., vol. 56, no. 6, pp. 14–20, June 2018.
  • [2] Y. Li, Y. Zhang, K. Luo, T. Jiang, Z. Li, and W. Peng, “Ultra-dense hetnets meet big data: Green frameworks, techniques, and approaches,” IEEE Commun. Mag., vol. 56, no. 6, pp. 56–63, June 2018.
  • [3] Y. Zhong, X. Ge, H. H. Yang, T. Han, and Q. Li, “Traffic matching in 5G ultra-dense networks,” IEEE Commun. Mag., vol. 56, no. 8, pp. 100–105, August 2018.
  • [4] H. Zhang, L. Song, and Y. J. Zhang, “Load balancing for 5G ultra-dense networks using device-to-device communications,” IEEE Trans. Wireless Commun., vol. 17, no. 6, pp. 4039–4050, June 2018.
  • [5] I. Siomina and D. Yuan, “Optimization of pilot power for load balancing in WCDMA networks,” in IEEE Global Communications Conference (GLOBECOM), Dallas, Texas, USA, November 2004, pp. 3872–3876.
  • [6] A. Wang and V. Krishnamurthy, “Mobility enhanced smart antenna adaptive sectoring for uplink capacity maximization in CDMA cellular network,” IEEE Trans. Commun., vol. 56, no. 5, pp. 743–753, May 2008.
  • [7] M. Qin, Q. Yang, N. Cheng, H. Zhou, R. Rao, and X. Shen, “Machine learning aided context-aware self-healing management for ultra dense networks with QoS provisions,” IEEE Trans. Veh. Technol., October 2018, to appear.
  • [8] Z. Du, Y. Sun, W. Guo, Y. Xu, Q. Wu, and J. Zhang, “Data-driven deployment and cooperative self-organization in ultra-dense small cell networks,” IEEE Access, vol. 6, no. 1, pp. 22 839–22 848, April 2018.
  • [9] M. Sheng, C. Yang, Y. Zhang, and J. Li, “Zone-based load balancing in LTE self-optimizing networks: A game-theoretic approach,” IEEE Trans. Veh. Technol., vol. 63, no. 6, pp. 2916–2925, July 2014.
  • [10] R. Kwan, R. Arnott, R. Paterson, and R. Trivisonno, “On mobility load balancing for LTE systems,” in IEEE Vehicular Technology Conference (VTC Fall), Taipei, Taiwan, September 2010, pp. 1–5.
  • [11] Y. Yang, P. Li, X. Chen, and W. Wang, “A high-efficient algorithm of mobile load balancing in LTE system,” in IEEE Vehicular Technology Conference (VTC Fall), Yokohama, Japan, September 2012, pp. 1–5.
  • [12] J. Park, Y. Kim, and J. Lee, “Mobility load balancing method for self-organizing wireless networks inspired by synchronization and matching with preferences,” IEEE Trans. Veh. Technol., vol. 67, no. 3, pp. 2594–2606, March 2018.
  • [13] S. Bi, R. Zhang, Z. Ding, and S. Cui, “Wireless communications in the era of big data,” IEEE Trans. Wireless Commun., vol. 53, no. 10, pp. 190–199, October 2015.
  • [14] M. Chen, U. Challita, W. Saad, C. Yin, and M. Debbah, “Machine learning for wireless networks with artificial intelligence: A tutorial on neural networks,” arXiv preprint:1710.02913, October 2017. [Online]. Available: https://arxiv.org/abs/1710.02913
  • [15] W. Xu, Y. Xu, C. H. Lee, Z. Feng, P. Zhang, and J. Lin, “Data-cognition-empowered intelligent wireless networks: Data, utilities, cognition brain, and architecture,” IEEE Wireless Commun., vol. 25, no. 1, pp. 56–63, February 2018.
  • [16] P. Muñoz, R. Barco, J. M. Ruiz-Avilés, I. D. L. Bandera, and A. Aguilar, “Fuzzy rule-based reinforcement learning for load balancing techniques in enterprise LTE femtocells,” IEEE Trans. Veh. Technol., vol. 62, no. 5, pp. 1962–1973, June 2013.
  • [17] S. S. Mwanje, L. C. Schmelz, and A. Mitschele-Thiel, “Cognitive cellular networks: A Q-learning framework for self-organizing networks,” IEEE Trans. Netw. Service Manag., vol. 13, no. 1, pp. 85–98, March 2016.
  • [18] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, d. D. G. Van, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, and M. Lanctot, “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, January 2016.
  • [19] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of Go without human knowledge,” Nature, vol. 550, no. 7676, p. 354, October 2017.
  • [20] M. Mahmud, M. S. Kaiser, A. Hussain, and S. Vassanelli, “Applications of deep learning and reinforcement learning to biological data,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 6, pp. 2063–2079, June 2018.
  • [21] Z. Wang, L. Li, Y. Xu, H. Tian, and S. Cui, “Handover control in wireless systems via asynchronous multi-user deep reinforcement learning,” IEEE Internet Things J., vol. 5, no. 6, pp. 4296–4307, June 2018.
  • [22] Y. Xu, W. Xu, Z. Wang, J. Lin, and S. Cui, “Deep reinforcement learning based mobility load balancing under multiple behavior policies,” in IEEE International Conference on Communications (ICC), Shanghai, China, May 2019, to appear.
  • [23] 3GPP, “TS 36.331 Evolved universal terrestrial radio access (E-UTRAN); Radio resource control (RRC); Protocol specification,” Tech. Rep. Release 8, July 2009.
  • [24] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press, 2018.
  • [25] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in International Conference on Machine Learning (ICML), Beijing, China, June 2014, pp. 387–395.
  • [26] S. Lloyd, “Least squares quantization in PCM,” IEEE Trans. Inf. Theory, vol. 28, no. 2, pp. 129–137, March 1982.
  • [27] M. E. Celebi, H. A. Kingravi, and P. A. Vela, “A comparative study of efficient initialization methods for the k-means clustering algorithm,” Expert systems with applications, vol. 40, no. 1, pp. 200–210, January 2013.
  • [28] U. Maulik and S. Bandyopadhyay, “Performance evaluation of some clustering algorithms and validity indices,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 12, pp. 1650–1654, December 2002.
  • [29] S. V. Macua, J. Chen, S. Zazo, and A. H. Sayed, “Distributed policy evaluation under multiple behavior strategies,” IEEE Trans. Autom. Control, vol. 60, no. 5, pp. 1260–1274, May 2015.
  • [30] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, and G. Ostrovski, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, February 2015.
  • [31] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” in International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, May 2016, pp. 1–14.
  • [32] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning (ICML), New York, USA, June 2016, pp. 1928–1937.
  • [33] A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. D. Maria, V. Panneershelvam, M. Suleyman, C. Beattie, and S. Petersen, “Massively parallel methods for deep reinforcement learning,” arXiv preprint:1507.04296, July 2015. [Online]. Available: https://arxiv.org/abs/1507.04296
  • [34] M. Chu, H. Li, X. Liao, and S. Cui, “Reinforcement learning-based multiaccess control and battery prediction with energy harvesting in iot systems,” IEEE Internet Things J., vol. 6, no. 2, pp. 2009–2020, April 2019.