Abstract
In this paper, we propose a deep reinforcement learning (DRL) based mobilityload balancing (MLB) algorithm along with a twolayer architecture to solve thelargescale load balancing problem for ultradense networks (UDNs). Ourcontribution is threefold. First, this work proposes a twolayer architectureto solve the largescale load balancing problem in a selforganized manner. Theproposed architecture can alleviate the global traffic variations bydynamically grouping small cells into selforganized clusters according totheir historical loads, and further adapt to local traffic variations throughintracluster load balancing afterwards. Second, for the intracluster loadbalancing, this paper proposes an offpolicy DRLbased MLB algorithm toautonomously learn the optimal MLB policy under an asynchronous parallellearning framework, without any prior knowledge assumed over the underlying UDNenvironments. Moreover, the algorithm enables joint exploration with multiplebehavior policies, such that the traditional MLB methods can be used to guidethe learning process thereby improving the learning efficiency and stability.Third, this work proposes an offlineevaluation based safeguard mechanism toensure that the online system can always operate with the optimal andwelltrained MLB policy, which not only stabilizes the online performance butalso enables the exploration beyond current policies to make full use ofmachine learning in a safe way. Empirical results verify that the proposedframework outperforms the existing MLB methods in general UDN environmentsfeatured with irregular network topologies, coupled interferences, and randomuser movements, in terms of the load balancing performance.
Quick Read (beta)
Load Balancing for UltraDense Networks: A Deep Reinforcement Learning Based Approach
Abstract
In this paper, we propose a deep reinforcement learning (DRL) based mobility load balancing (MLB) algorithm along with a twolayer architecture to solve the largescale load balancing problem for ultradense networks (UDNs). Our contribution is threefold. First, this work proposes a twolayer architecture to solve the largescale load balancing problem in a selforganized manner. The proposed architecture can alleviate the global traffic variations by dynamically grouping small cells into selforganized clusters according to their historical loads, and further adapt to local traffic variations through intracluster load balancing afterwards. Second, for the intracluster load balancing, this paper proposes an offpolicy DRLbased MLB algorithm to autonomously learn the optimal MLB policy under an asynchronous parallel learning framework, without any prior knowledge assumed over the underlying UDN environments. Moreover, the algorithm enables joint exploration with multiple behavior policies, such that the traditional MLB methods can be used to guide the learning process thereby improving the learning efficiency and stability. Third, this work proposes an offlineevaluation based safeguard mechanism to ensure that the online system can always operate with the optimal and welltrained MLB policy, which not only stabilizes the online performance but also enables the exploration beyond current policies to make full use of machine learning in a safe way. Empirical results verify that the proposed framework outperforms the existing MLB methods in general UDN environments featured with irregular network topologies, coupled interferences, and random user movements, in terms of the load balancing performance.
I Introduction
The fifth generation (5G) system is anticipated to meet the explosive growth of mobile data traffic, where the ultradense network (UDN) has been widely recognized as one of the most promising solutions [1, 2]. In UDNs, the small cell base stations (SBSs) are densely deployed to reduce the distance between the mobile users and the access points, such that the link quality could be substantially improved to increase the network capacity. However, such densification exacerbates the issue of traffic fluctuation compared with the system with large cells, which will greatly affect the network performances, e.g., the quality of service (QoS) experienced by users [3, 4]. Therefore, load balancing, which aims at alleviating traffic fluctuation, is becoming increasingly important in UDNs. On the other hand, the ultradense deployment of small cells brings numerous challenges in largescale network control, communications and optimizations, where configuring the network for load balancing in UDNs is different from that in traditional cellular networks [3, 4]. For example, the traditional load balancing in cellular networks through power control [5] or changing radiation patterns [6] may change the association of all the users in the coverage area, which could downgrade the performance due to the severely coupled interference in a UDN with dense and irregular small cell topology. Thus, exploring the appropriate manner for load balancing in 5G UDNs to accommodate its irregular network topologies, complicated interference relationships, and diversified user mobility patterns still remains a challenging research problem.
Recently, selforganizing networks (SONs), which promote autonomic learning and selfmanagement, have already been recognized as an attractive paradigm to realize adaptive management for 5G UDNs in a scalable manner [7, 8]. Therefore, it is promising to incorporate selforganizing functionalities into 5G UDNs to enable adaptive and scalable load balancing. The traditional load balancing technique with selforganizing characteristics is referred to as mobility load balancing (MLB) [9, 10, 11, 12], which usually optimizes the load distribution via tuning a parameter named cell individual offset (CIO) to control the user handovers. Current literature has proposed various MLB methods. For example, the rulebased methods map system states into predefined load balancing rules. Raymond et al. [10] proposed to change the CIOs via the rule with a fixed stepsize based on the load difference between adjacent cells; Yang et al. [11] later proposed to use the rule with an adaptive stepsize to reach load equilibrium more efficiently. However, for rulebased controllers, the predefined rules are expected to cover all possible circumstances, which requires a perfect understanding on both the system and the rules. Gametheoretic methods, on the other hand, model the MLB process as an ongoing game among cells. For example, Sheng et al. [9] modeled the MLB problem as a Cournot game; Park et al. [12] optimized the user assignment and the target cell selection based on the Kuramoto synchronization and matching theory, respectively. However, the equilibrium and optimality of gametheoretic models are usually built upon specific assumptions over the wireless gaming environment.
On the other hand, the emerging paradigm of machine learning and big data is reshaping the future of wireless networks [13, 14, 15, 2]. Reinforcement learning (RL), which is one class of machine learning methods that can adapt to unknown environments by learning from the environmental feedbacks, has already been applied in load balancing [16, 17]. However, the existing RLbased load balancing methods need to first quantize the system states and the control actions to construct a tabular environment [16, 17], which suffers from the curse of dimensionality, i.e., the numbers of actions and states increase exponentially with the degrees of freedom, thus largely hindering their applications to the complex 5G UDN environments. Noticeably, the recent combination of deep learning and RL, coined as DRL, has greatly improved the generalization capability of RL to achieve the stateoftheart performances in many fields, such as competitive gamings [18, 19], biological data analysis [20], mobility robustness optimization [21], etc. Moreover, our previous study indicates that the DRL can also achieve stateofart load balancing performance under a smallscale SON [22]. These successes envision a bright future to exploit DRL in realizing adaptive and autonomous largescale load balancing under complex UDN environments, e.g., coupled interferences and irregular network topologies. However, this paradigm still remains to be explored.
IA Contributions
In this paper, we investigate the largescale MLB problem in UDNs with emphasis on three important features, i.e., generalization, adaptation, and scalability. The main contributions are summarized as follows.

•
Architecture: This work proposes a twolayer MLB architecture to handle the largescale load balancing problem for UDNs in a selforganized manner. The top layer aims at dynamically grouping the underlying SBSs into selforganized clusters with adaptation to global traffic variations by using a tailored kmeans algorithm, which picks the top overloaded SBSs as the initial cluster centroids and then groups SBSs based on their locations; the bottom layer aims at balancing the intracluster load distribution with adaptation to local traffic variations by using a DRLbased algorithm. Such a selforganized mechanism breaks the largescale load balancing problem into smaller pieces that are easier to be handled distributively, thereby improving the system scalability and costefficiency.

•
Algorithm: First, this work proposes an offpolicy DRLbased algorithm for MLB, which approximates the MLB policy with deep neural networks to largely improve the model generalization capability over complex UDN environments. The algorithm can be trained under an asynchronous parallel learning framework to autonomously learn the optimal MLB policy without any prior knowledge over the underlying wireless environments. Second, the proposed algorithm enables joint exploration with multiple behavior policies, such that existing MLB methods could be used to guide the learning process thereby improving the learning efficiency and stability at the early stage. Empirical results verify that the proposed DRLbased MLB model can adapt to general UDN environments featured with irregular network deployment, coupled interferences, and random user mobility patterns, and outperforms existing MLB methods considerably.

•
Mechanism: This work proposes an offlineevaluation based safeguard mechanism to improve the online control performance in practical systems. Specifically, the proposed mechanism runs an online branch for system control and an offline branch for policy learning in parallel. The online branch always uses the best MLB policy welltrained by the offline branch. As such, the mechanism can avoid the risk of exhibiting unstable or even destructive behaviors when following the undertrained policy at the early learning stage. Moreover, such a mechanism also enables policy exploration in a safe way, which is of vital importance to make full use of machine learning to go beyond existing methods for superior alternatives.
Indeed, the proposed architecture, algorithm and mechanism form a general autonomous and intelligent network control framework, which is also promising to solve other largescale network control problems in the future systems by changing the learning context.
The remainder of this paper is organized as follows. Section II introduces the twolayer MLB architecture and the system model. Section III presents the centralized clustering algorithm. Section IV presents the selforganized DRLbased MLB algorithm. Section V presents the safeguard mechanism for online control. Section VI discusses the simulation results. Finally, Section VII concludes the paper.
II System Model
IIA Twolayer MLB Architecture for Load Balancing in UDNs
In this paper, we propose a twolayer MLB architecture to control largescale load balancing for UDNs in a selforganized manner. As shown in Fig. 1, the twolayer architecture includes a centralized loaddriven clustering controller at the top layer, and multiple selforganized load balancing controllers at the bottom layer. The top layer aims at grouping all the SBSs into a bunch of clusters according to their historical load levels, repeatedly performed on the order of hours; the bottom layer aims at balancing the load within each cluster, repeatedly performed on the order of seconds up to minutes. In this way, the top layer adapts to the dynamic global traffic fluctuations from a macroscopic view, while the bottom layer tunes the intracluster load distribution at a finer level.
The main advantages of the proposed architecture can be summarized as follows.

1.
Scalability: The largescale UDN can now be managed by multiple selforganized controllers, where each of them only needs to work with a small portion of the entire network, i.e., one cluster. As such, each controller only needs to adapt their configurations autonomously based on partial information of the network, which largely reduces the global information exchanges and is more robust to the network topology variation, thus having a strong scalability.

2.
Efficiency: The scale of the nonconvex load balancing problem increases with the the number of SBSs, which implies an increasing amount of training overhead when using learningbased algorithms. Meanwhile, the global minimum is not easy to obtain. The proposed selforganized paradigm breaks the large problem into smaller pieces that are easier to handle, such that the high computational complexity can be distributed to a bunch of selforganized controllers, which can work in parallel. In this way, the algorithm is expected to find better locally optimal solutions with improved costefficiency.
In this paper, we consider a MLB scenario where the SBSs are densely and randomly deployed. The overloaded SBSs handover the mobile users to their corresponding neighboring SBSs for traffic offloading by tuning the CIOs. The distributed SBSs collect the environment data, e.g., SBS loads, and deliver the data to the controllers at the bottom layer. The top layer dynamically groups the SBS into a bunch of clusters by using a loaddriven clustering algorithm; the bottom layer performs selforganized intracluster load balancing based on partial network information by using a DRLbased learning algorithm. The detailed workflow is shown in Fig. 2 and can be interpreted as follows. The clustering at the top layer is triggered either periodically or based on certain traffic fluctuations, e.g., the sudden increase of traffic due to some sport events, the daily tidal effect, etc. We refer to the time period between two clustering operations at the top layer as one MLB stage.
First, at the beginning of MLB stage $k$, the top layer groups all the SBSs into multiple clusters according to their averaged load levels during the previous stage. The clustering result stays unchanged over the stage. Then, the bottom layer performs intracluster MLB for each cluster by executing: 1) action selection, i.e., selecting the MLB actions according to the DRLbased MLB policy; 2) system control, i.e., executing the control actions for intracluster load offloading; 3) policy improvement, i.e., improving its DRLbased MLB policy by learning. The stage $k$ repeats the above three procedures until the next stage $k+1$, i.e., until the reclustering is triggered.
IIB Handover Management
IIB1 Handover
The handover process in MLB aims at transferring a mobile user from its serving SBS to a neighboring SBS if better signal quality can be reached. Specifically, at a given time $t$, the handover of one mobile user from its serving SBS $i$ to a neighboring SBS $j$ is triggered according to the A3 condition [23] as
$${F}_{j}^{t}{F}_{i}^{t}>{O}_{ij}^{t}+Hys,$$  (1) 
where ${F}_{i}^{t}$ and ${F}_{j}^{t}$ are the user’s reference signal received power (RSRP) of its serving SBS $i$ and the neighboring SBS $j$, respectively, $Hys$ is the handover hysteresis, which is usually a fixed value to prevent frequent handovers, and ${O}_{ij}^{t}$ is the CIO between SBS $i$ and SBS $j$. Note that ${O}_{ij}^{t}$ is usually symmetrical, i.e., ${O}_{ij}^{t}={O}_{ji}^{t}$, to ensure that a user handed over from one SBS to another will not be handed straight back to prevent the pingpong effect [10].
According to (1), decreasing a proper value of ${O}_{ij}^{t}$ can trigger user handovers from SBS $i$ to SBS $j$, thereby offloading the load from SBS $i$ to SBS $j$, and vise versa. Therefore, the key of MLB is to find the best CIO tuning policy to trigger user handovers optimally.
IIB2 User Throughput
The signaltonoiseplusinterference ratio (SINR) of user $u$ in SBS $i$ at time $t$ is given by
$${\text{SINR}}_{u}^{t}=\frac{{P}_{i}{G}_{ui}^{t}}{{N}_{0}+{\sum}_{j\in \mathcal{I},j\ne i}{P}_{j}{G}_{uj}^{t}},$$  (2) 
where $\mathcal{I}$ is the SBS set, ${P}_{i}$ is the transmission power of SBS $i$, ${G}_{ui}^{t}$ is the channel power gain between user $u$ and SBS $i$ at time $t$, and ${N}_{0}$ is the noise power, which is assumed, without loss of generality, to be the same for all the users. We assume that the smallest resource unit to allocate is the physical resource block (PRB). For a given user $u$, the maximum transmission rate of one PRB at time $t$ is given by
$${R}_{u}^{t}=B{\mathrm{log}}_{2}(1+{\text{SINR}}_{u}^{t}),$$  (3) 
where $B$ is the spectrum bandwidth of one PRB.
IIB3 Load
In this paper, we assume that each user has a constant bit rate (CBR) requirement ${M}_{u}^{t}$ at time $t$. The required number of PRBs to meet the demand ${M}_{u}^{t}$ is
$${N}_{u}^{t}=\mathrm{min}\{\frac{{M}_{u}^{t}}{{R}_{u}^{t}},{N}_{c}\},$$  (4) 
where ${N}_{c}$ is a constant threshold to keep the number of PRBs occupied by users with poor channel quality to be under a reasonable level. Finally, the load of SBS $i$, defined as the ratio of users’ required number of PRBs versus the total number of PRBs, can be written as
$${\rho}_{i}^{t}=\frac{{\sum}_{u\in {\mathcal{U}}_{i}^{t}}{N}_{u}^{t}}{{N}_{i}^{p}},$$  (5) 
where ${N}_{i}^{P}$ is the total number of PRBs of SBS $i$, and ${\mathcal{U}}_{i}^{t}$ is the set of users assigned to SBS $i$ at time $t$.
IIC Preliminaries on Reinforcement Learning
IIC1 Reinforcement Learning
In this paper, we formulate the load balancing problem with RL. Specifically, RL aims at maximizing a cumulative reward by selecting a sequence of optimal actions under different system states in a stochastic unknown environment [24]. The dynamics is usually modeled as a Markov decision process (MDP), which can be characterized by a state space $\mathcal{S}$, an action space $\mathcal{A}$, a reward function $r:\mathcal{S}\times \mathcal{A}\to {\mathbb{R}}^{1}$, and a stationary transition probability satisfying the Markov property $p({\bm{s}}_{t+1}{\bm{s}}_{1},{\bm{a}}_{1},\mathrm{\cdots},{\bm{s}}_{t},{\bm{a}}_{t})=p({\bm{s}}_{t+1}{\bm{s}}_{t},{\bm{a}}_{t})$, where $\bm{s}\in \mathcal{S},\bm{a}\in \mathcal{A}$. Specifically, at each state ${\bm{s}}_{t}\in \mathcal{S}$, the RL agent selects an action ${\bm{a}}_{t}\in \mathcal{A}$ by following a policy $\pi $ to interact with the environment, and receives a reward $r({\bm{s}}_{t},{\bm{a}}_{t})$; then the state ${s}_{t}$ moves on to ${\bm{s}}_{t+1}$ and the system dynamics starts the next round.
IIC2 Value Function
In RL, the value function for a given policy $\pi $ at state $\bm{s}$ is defined to be the received longterm expected cumulative rewards starting at state $\bm{s}$ and following the policy $\pi $ thereafter [24], i.e.,
$${V}^{\pi}(\bm{s})={\mathbb{E}}^{\pi}\left[\sum _{t=0}^{\mathrm{\infty}}{\gamma}^{t}r({\bm{s}}_{t},{\bm{a}}_{t})\right{\bm{s}}_{t}=\bm{s}],$$  (6) 
where $\gamma \in [0,1]$ is the discount factor. Similarly, the Qfunction for a given policy $\pi $ when choosing action $\bm{a}$ at state $\bm{s}$ is defined as
$${Q}^{\pi}(\bm{s},\bm{a})={\mathbb{E}}^{\pi}\left[\sum _{t=0}^{\mathrm{\infty}}{\gamma}^{t}r({\bm{s}}_{t},{\bm{a}}_{t})\right{\bm{s}}_{t}=\bm{s},{\bm{a}}_{t}=\bm{a}].$$  (7) 
Generally, the goal of RL is to find the optimal policy ${\pi}^{*}$ which maximizes the cumulative discounted rewards from the start state ${\bm{s}}_{0}$ [24].
IIC3 Offpolicy Learning
The learning and exploration in RL usually involve two policies: the target policy and the behavior policy. The target policy is optimized by using the samples generated by following the behavior policy. In practice, the target policy is usually used for system control, while the behavior policy is used for exploration. In the literature, there are two classes of learning methods in RL [24], i.e., the onpolicy methods and the offpolicy methods. The onpolicy methods use the same policy as the target policy and the behavior policy, in other words, using the same policy for both system control and exploration. The offpolicy methods, on the other hand, separate the target policy from the behavior policy, where the target policy is optimized by using the samples generated from the behavior policy based on the importance sampling technique [24].
In this paper, we focus on the offpolicy RL for MLB with the following motivations. First, it would be problematic or even dangerous to apply the onpolicy methods to online systems, since it requires online exploration, i.e., performing random actions for system control in order to generate randomized learning samples. Comparatively, with offpolicy learning, we can employ a deterministic policy for online control while following a different stochastic behavior policy to explore learning samples offline, e.g., by using a network evaluation system, thus avoiding the risk of online exploration. Second, the samples generated from different behavior policies or in previous explorations can be reused in the offpolicy RL, which is the foundation for our later proposed learning scheme under multiple behavior policies.
IID Problem Formulation
The clustering at the top layer partitions the largescale load balancing problem into multiple small intracluster load balancing problems, where each load balancing controller at the bottom layer only needs to tune the CIOs of its corresponding intracluster SBSs.
In this paper, we model the MLB problem in UDN as a MDP and exploit the offpolicy RL to learn the optimal intracluster CIO tuning policy. Specifically, the SBS load distribution and user distribution are defined as the state, the change of CIO values is defined as the action, and the load balancing utility is defined as the reward. At each time $t$, the bottom layer selects the MLB action ${\bm{a}}_{t}$ according to the observed UDN state ${\bm{s}}_{t}$, which shifts the system to the next state ${\bm{s}}_{t+1}$ according to the state transition probability $p({\bm{s}}_{t+1}{\bm{s}}_{t},{\bm{a}}_{t})$. Note that the transition probability is conditioned on both the current UDN state ${\bm{s}}_{t}$ and the MLB action ${\bm{a}}_{t}$, which indicates that the MLB actions affect not only the immediate MLB reward but also the next UDN state and, through that, all subsequent MLB rewards. Compared with the traditional formulations which usually aim at maximizing the load balancing utility at a specific time slot, the RLbased formulation is more farsighted, aiming at achieving a balanced load distribution over a long time horizon. Moreover, the RLbased formulation learns from the incrementally generated data samples when interacting with the UDN environment for MLB, thereby bearing a builtin datadriven learning nature. The detailed learning context is defined as follows.
IID1 States
The inputs of the RLbased learning algorithm are the wireless system status. However, directly using the raw system data as the RL inputs is problematic: first, the state space could be too large to enumerate; second, the highdimensional raw data is computationally challenging to process and time costing to transmit; third, the raw data may contain too much redundancy to negatively impact the learning performance. Therefore, in this paper, we propose to use a collection of highlevel features to represent the wireless system status as the RL inputs. Specifically, the state for each SBS $i\in \mathcal{I}$ is composed of: i) the SBS load derived from the averaged load ${\stackrel{~}{\rho}}_{i}={\rho}_{i}{\rho}_{g}$, where ${\rho}_{g}=\frac{1}{N}{\sum}_{i=1}^{N}{\rho}_{i}$ with $N$ denoting the number of SBS; ii) the fraction of the edge users ${E}_{i}$. Here we categorize the edge users according to their downlink SINRs to the corresponding serving SBS and the neighboring SBSs. Generally, one SBS with more edge users indicates that its handover is more sensitive to the change of CIOs. In summary, the state vector could be written as
$${\bm{s}}_{t}={[{\stackrel{~}{\rho}}_{1}^{t},{\stackrel{~}{\rho}}_{2}^{t},\mathrm{\cdots},{\stackrel{~}{\rho}}_{N}^{t},{E}_{1}^{t},{E}_{2}^{t},\mathrm{\cdots},{E}_{N}^{t}]}^{\top}.$$  (8) 
It is noteworthy to point out that the feature selection is not unique; other combinations could be explored in the future.
IID2 Actions
The output of the RLbased learning algorithm is the CIO values, i.e.,
$${\bm{a}}_{t}=\{{O}_{ij}(t)\forall i,j\in \mathcal{I}\}.$$  (9) 
Here we have ${O}_{ij}\in [{O}_{\mathrm{min}},{O}_{\mathrm{max}}],\forall i,j\in \mathcal{I}$, where ${O}_{\mathrm{min}}$ and ${O}_{\mathrm{max}}$ are the predefined lower bound and upper bound, respectively. Note that, in this paper, in order to study a general control problem, we consider the CIO value to be continuous.
IID3 Rewards
The load distribution could be balanced by minimizing the maximum load (or equivalently maximizing the inverse of the maximum load) of all the SBSs [12], i.e., alleviating the worst case. Hence, we define the reward function to be
$$r({\bm{s}}_{t},{\bm{a}}_{t})=\frac{1}{{\mathrm{max}}_{i\in \mathcal{I}}{\rho}_{i}^{t}}.$$  (10) 
IID4 Policy
The policy in RL is usually modeled as a stochastic function $\pi :\mathcal{S}\to \mathcal{P}(\mathcal{A})$, which characterizes the probability of selecting an action ${a}_{t}\in \mathcal{A}$ at an arbitrary state ${s}_{t}\in \mathcal{S}$. However, in this paper, we consider a deterministic policy for CIO tuning in practice, which can be parameterized as
$${\pi}_{\theta}:\mathcal{S}\to \mathcal{A},$$  (11) 
where $\theta \in {\mathbb{R}}^{n}$ is the parameter to optimize. Particularly, in this paper, the policy is parameterized by the deep neural network to improve the model generalization capability, which is discussed in Sec. IVD.
IID5 Objective
Our goal is to find a CIO tuning policy which can maximize the cumulative discounted MLB reward from the start state ${s}_{0}$. The objective function can be written as
${J}_{\beta}({\pi}_{\theta})$  $={\displaystyle {\int}_{\mathcal{S}}}{\kappa}^{\beta}(s){V}^{{\pi}_{\theta}}(s)\mathit{d}s$  (12a)  
$={\displaystyle {\int}_{\mathcal{S}}}{\kappa}^{\beta}(s){\displaystyle {\int}_{\mathcal{A}}}\pi (s,a){Q}^{{\pi}_{\theta}}(s,a)\mathit{d}a\mathit{d}s$  (12b)  
$={\displaystyle {\int}_{\mathcal{S}}}{\kappa}^{\beta}(s){Q}^{{\pi}_{\theta}}(s,{\pi}_{\theta}(s))\mathit{d}s$  (12c)  
$={\mathbb{E}}_{s\sim {\kappa}^{\beta}}\left[{\displaystyle \sum _{k=0}^{\mathrm{\infty}}}{\gamma}^{k}r(s,{\pi}_{\theta}(s))\right],$  (12d) 
where $\beta :\mathcal{S}\to \mathcal{P}(\mathcal{A})$ is a stochastic behavior policy, and
$${\kappa}^{\beta}(s):={\int}_{\mathcal{S}}\sum _{t=1}^{\mathrm{\infty}}{\gamma}^{t1}{p}_{1}(s)p(s\to {s}^{\prime},t,\beta )ds,$$  (13) 
is the discounted state visitation distribution [25] when following the behavior policy $\beta $ with ${p}_{1}(s)$ the probability of the starting state and $p(s\to {s}^{\prime},t,\beta )$ the transition probability from state $s$ to state ${s}^{\prime}$ under policy $\beta $ at time $t$. Generally, the objective can be viewed as the value function or the Qfunction of the target policy averaged over the state distribution of the behavior policy [25]. Finally, the offpolicy RLbased intracluster MLB problem could be written as
${\mathcal{P}}_{0}:$  $\underset{\theta}{\mathrm{max}}{J}_{\beta}({\pi}_{\theta})$  (14)  
$\mathrm{s}.\mathrm{t}.$  ${C}_{1}:{\mathcal{X}}_{u,i}^{t}\in \{0,1\},{\displaystyle \sum _{i\in \mathcal{I}}}{\mathcal{X}}_{u,i}^{t}\le 1,\forall u\in \mathcal{U},$  (15)  
${C}_{2}:{O}_{ij}\in [{O}_{\mathrm{min}},{O}_{\mathrm{max}}],\forall i,j\in \mathcal{I},$  (16) 
where ${\mathcal{X}}_{u,i}^{t}$ defines the uniqueness of the usertoSBS association, ${C}_{1}$ defines the uniqueness of user association, and ${C}_{2}$ defines the bound over the CIO values.
III Centralized Dynamic LoadDriven Clustering
In our proposed architecture, the top layer aims at dynamically grouping all the SBSs into a bunch of clusters according to the global traffic variations. The clustering can be triggered either periodically or based on certain traffic patterns, e.g., the sudden increase of traffic due to some sport events, the daily tidaleffect, etc. In this paper, we tailor the celebrated Lloyd’s kmeans clustering method [26] to perform loaddriven clustering for the underlying SBSs, which is fast to converge in practice. The proposed clustering algorithm consists of two phases: 1) loadbased initialization, which is based on the detection of traffic hotspots, i.e., the top overloaded SBSs; 2) locationbased SBS clustering, which aims at grouping the hotspot SBSs with their nearby SBSs based on the geographical distance.
III1 Loadbased Initialization
The selection of initial cluster centroids can largely influence the clustering result. A commonly used strategy is to randomly choose a bunch of SBSs as the initial centroids [27]. Based on the domain knowledge, in this paper, we instead propose to choose the top $k$ overloaded SBSs as the initial centroids in order to efficiently offload the traffic from the hotspot SBSs for load balancing. First, we define the stageaveraged load to be the averaged SBS load over the previous MLB stage. Specifically, for each SBS $i\in \mathcal{I}$, we define its stageaveraged load over the previous stage $k1$ as
$${\widehat{\rho}}_{i}^{k1}=\frac{1}{{T}_{k1}}\sum _{t={t}_{k1}^{0}}^{{t}_{k1}^{0}+{T}_{k1}1}{\rho}_{i}^{t},$$  (17) 
where ${t}_{k1}^{0}$ is the start time of stage $k1$, and ${T}_{k1}$ is the time length of stage $k1$. The stageaveraged load can reflect a fair load level by reducing the influence from abnormal traffic variations. Next, we rank the SBSs according to their stageaveraged loads to obtain a sorted list ${\bm{C}}_{\text{list}}$ as the initial cluster centroids.
III2 Locationbased Clustering
Given the list of candidate centroids from the initialization phase, we next group the SBSs according to their geographical locations by minimizing the sum of squared error (SSE):
$\underset{{\mathcal{C}}_{h},{\bm{c}}_{h}}{\mathrm{min}}{\displaystyle \sum _{h=1}^{H}}{\displaystyle \sum _{{\bm{x}}_{i}\in {\mathcal{C}}_{h}}}{\parallel {\bm{x}}_{i}{\bm{c}}_{h}\parallel}_{2}^{2}$  (18)  
$\mathrm{s}.\mathrm{t}.$  ${\mathcal{X}}_{i,h}^{t}\in \{0,1\},{\displaystyle \sum _{h\in \mathscr{H}}}{\mathcal{X}}_{i,h}^{t}\le 1,\forall i\in {\mathcal{C}}_{h},$  (19) 
where $H$ is the predetermined number of clusters, ${\mathcal{C}}_{h}$ is the set of SBSs that are assigned to the cluster $h$, ${\bm{x}}_{i}$ is the location of SBS $i$, ${\mathcal{X}}_{i,h}^{t}$ defines the uniqueness of the SBStocluster association, and
$${\bm{c}}_{h}=\frac{1}{{\mathcal{C}}_{h}}\sum _{{\bm{x}}_{i}\in {\mathcal{C}}_{h}}{\bm{x}}_{i},$$  (20) 
is the centroid location of cluster ${\mathcal{C}}_{k}$ with $\cdot $ the cardinality. The detailed procedure is provided in Algorithm III2.
Note that the selection of $H$ is also critical. The clustering validation methods, e.g., the CalinskiHarabasz index [28], can be employed to evaluate all the possible $H$ and choose the best one. For example, we can first set a searching range for the candidate $H$, e.g., $\mathscr{H}=\{1,2,\mathrm{\cdots},{H}_{\text{max}}\}$ where ${H}_{\text{max}}$ is the upper bound; then, we evaluate the CalinskiHarabasz index of the clustering result with each $h\in \{1,2,\mathrm{\cdots},{H}_{\text{max}}\}$; finally, we choose the one with the highest CalinskiHarabasz index as the optimal choice.
It is noteworthy to remark that our proposed kmeans algorithm is only one possible solution for the dynamic clustering at the top layer; other explorations could be done in the future with different preferences, e.g., by considering user mobility pattern, social relationships, radio propagation, etc. {algorithm} \[email protected]@algorithmic[1] \STATEInput: \STATE $t$; ${\rho}_{i}^{t}$; ${\bm{x}}_{i}$; number of clusters $H$; number of SBSs $N$. \STATEInitialization: \STATE Calculate the stageaveraged loads according to (17). \STATE Rank the stageaveraged loads to obtain ${\bm{C}}_{\text{list}}$. \STATE Initialize the cluster centroids with the top $H$ elements from ${\bm{C}}_{\text{list}}$. \STATEIteration: \REPEAT\FOR$i=1,2,\mathrm{\cdots},N$ \FOR$h=1,2,\mathrm{\cdots},H$ \STATE Update the SBStocluster association by
$${\mathcal{X}}_{i,h}^{t}=\mathrm{arg}\mathrm{min}{\parallel {\bm{x}}_{i}{\bm{c}}_{h}\parallel}_{2}^{2}$$ 
$h=1,2,\mathrm{\cdots},H$ \STATEUpdate the cluster centroids by
$${\bm{c}}_{h}=\frac{{\sum}_{i=1}^{N}{\mathcal{X}}_{i,h}^{t}{\bm{x}}_{i}}{{\sum}_{i=1}^{N}{\mathcal{X}}_{i,h}^{t}}$$ 
No change of the cluster centroids.
IV SelfOrganized DRLBased Load Balancing
The bottom layer aims at balancing the load within each cluster with the proposed offpolicy DRLbased MLB algorithm. Generally, the classic offpolicy method optimizes the target policy by following a single behavior policy for exploration. However, it is typical to assume that the behavior policy must allow the RL agent to explore every possible state infinitely often [24]. To this end, the commonly adopted behavior policy should be a fully randomized one, such that the explored highquality samples are usually very sparse, which makes the learning inefficient. Therefore, in this paper, we propose to adopt multiple behavior policies for joint exploration, which covers the use of a single behavior policy as a special case, and has the following advantages.

1.
Stability: The learning now only requires the jointly explored space of multiple behavior policies to cover the entire statespace. Hence, the learning over multiple behavior policies can perform well even though none of the behavior policies can lead to the optimal solution on its own.

2.
Efficiency: The explored samples by following different behavior policies are richer than simply following one behavior policy, which implies a higher probability to find highquality samples. Moreover, the behavior policies could include traditional MLB methods to directly generate highquality samples to realize expertguided or modelassisted learning, which could improve the learning efficiency.
Note that the learning under multiple behavior policies can be implemented for practical systems in many ways. For example, 1) the selforganized cluster could use different behavior policies in turn, e.g., using traditional MLB methods to guide learning at the beginning and switching to random exploration later; 2) different selforganized clusters under similar MDPs, e.g., due to similar cell layouts and user mobility patterns, can learn jointly, where each of them follows a different behavior policy and updates a shared target policy.^{1}^{1} 1 In this case, the learning under multiple behavior policies can be considered as a simple multiagent learning paradigm.; 3) the target policy can be trained offline, e.g., in a simulation platform, by following multiple behavior policies, and then be used for online control.
In the sequel, we first extend the policy gradient theorem for the case with multiple behavior policies, which is the key for policy improvement; then we present a parallel learning framework. Finally, we empower the learning framework with deep neural networks to largely improve its generalization capability in complex UDN environments.
IVA Policy Gradient with Multiple Behavior Policies
IVA1 Preliminaries on Policy Gradient
Traditional policy improvement methods find a better policy by greedily choosing the best action under all possible states according to the estimated value function or Qfunction [24]. However, such a greedy searching strategy becomes computationally demanding when the action space is large, and usually impossible for a continuous action space. Therefore, in this paper, we exploit the policy gradient method for policy improvement, which overcomes the limitations of the greedy searching strategy by explicitly optimizing a parameterized policy. Specifically, the policy gradient theorem [24] gives the analytic expression for the gradient of the objective $J({\pi}_{\bm{\theta}})$ with respect to the policy parameters $\bm{\theta}$, written as
$${\nabla}_{\bm{\theta}}J({\pi}_{\bm{\theta}})={\mathbb{E}}_{\bm{s}\sim {d}^{{\pi}_{\bm{\theta}}},\bm{a}\sim {\pi}_{\bm{\theta}}}[{\nabla}_{\bm{\theta}}\mathrm{log}{\pi}_{\bm{\theta}}(\bm{a}\bm{s}){Q}^{{\pi}_{\bm{\theta}}}(\bm{s},\bm{a})],$$  (21) 
where ${d}^{{\pi}_{\bm{\theta}}}$ is the state distribution when following policy ${\pi}_{\bm{\theta}}$. Later, Silver et al. proposed the offpolicy deterministic policy gradient (OPDPG) theorem [25] to optimize a deterministic target policy by following a single stochastic behavior policy, written as
${\nabla}_{\theta}{J}_{\beta}({\pi}_{\theta})$  $\approx {{\displaystyle {\int}_{\mathcal{S}}}{\rho}^{\beta}(s){\nabla}_{\theta}{\pi}_{\theta}(s){\nabla}_{a}{Q}^{\pi}(s,a)}_{a={\pi}_{\theta}(s)}\text{d}s$  (22a)  
$={\mathbb{E}}_{s\sim {\rho}^{\beta}}[{{\nabla}_{\theta}{\pi}_{\theta}(s){\nabla}_{a}{Q}^{\pi}(s,a)}_{a={\pi}_{\theta}(s)}],$  (22b) 
where the approximation comes from the drop of one term depending on the Qvalue gradient ${\nabla}_{\theta}{Q}^{\pi}(s,a)$ to preserve the convergence [25].
IVB OPDPG with Multiple Behavior Policies
The OPDPG theorem can be adopted to optimize our objective function with a proper extension on the learning over multiple behavior policies. Specifically, we denote the set of multiple behavior policies to follow as $\mathcal{M}=\{{\beta}_{1},{\beta}_{2},\mathrm{\cdots},{\beta}_{M}\}$, where $M$ is the total number of behavior policies. Based on the objective under a single behavior policy given in (12), the objective under multiple behavior policies could be written as
$$J({\pi}_{\theta})=\sum _{m\in \mathcal{M}}{J}_{{\beta}_{m}}({\pi}_{\theta}),$$  (23) 
where
$${J}_{{\beta}_{m}}({\pi}_{\theta})={\mathbb{E}}_{s\sim {\kappa}^{{\beta}_{m}}}\left[\sum _{k=0}^{\mathrm{\infty}}{\gamma}^{k}r(s,{\pi}_{\theta}(s))\right],$$  (24) 
with ${\kappa}^{{\beta}_{m}}(s):={\int}_{\mathcal{S}}{\sum}_{t=1}^{\mathrm{\infty}}{\gamma}^{t1}{p}_{1}(s)p(s\to {s}^{\prime},t,\beta )ds$. The new objective $J({\pi}_{\theta})$ can be viewed as the value function of the target policy averaged over the states generated by following multiple behavior policies. The intracluster MLB problem now becomes
${\mathcal{P}}_{1}:$  $\underset{\theta}{\mathrm{max}}J({\pi}_{\theta})={\displaystyle \sum _{m\in \mathcal{M}}}{J}_{{\beta}_{m}}({\pi}_{\theta})$  (25)  
$\mathrm{s}.\mathrm{t}.$  ${C}_{1}:{\mathcal{X}}_{u,i}^{t}\in \{0,1\},{\displaystyle \sum _{i\in \mathcal{I}}}{\mathcal{X}}_{u,i}^{t}\le 1,\forall u\in \mathcal{U},$  
${C}_{2}:{O}_{ij}\in [{O}_{\mathrm{min}},{O}_{\mathrm{max}}],\forall i,j\in \mathcal{I}.$ 
Accordingly, the policy gradient for the learning under multiple behavior policies is given as
${\nabla}_{\theta}J({\pi}_{\theta})={\displaystyle \sum _{m\in \mathcal{M}}}{\nabla}_{\theta}{J}_{{\beta}_{m}}({\pi}_{\theta})$  (26a)  
$\approx {{\displaystyle \sum _{m\in \mathcal{M}}}{\displaystyle {\int}_{\mathcal{S}}}{\rho}^{{\beta}_{m}}(s){\nabla}_{\theta}{\pi}_{\theta}(s){\nabla}_{a}{Q}^{\pi}(s,a)}_{a={\pi}_{\theta}(s)}\text{d}s$  (26b)  
$={\displaystyle \sum _{m\in \mathcal{M}}}{\mathbb{E}}_{s\sim {\rho}^{{\beta}_{m}}}[{{\nabla}_{\theta}{\pi}_{\theta}(s){\nabla}_{a}{Q}^{\pi}(s,a)}_{a={\pi}_{\theta}(s)}],$  (26c) 
where in (26b), we apply the OPDPG theorem for each behavior policy, respectively.
IVC Parallel Learning Framework
We now develop a parallel learning framework based on the actorcritic algorithm [24], where each agent follows a different behavior policy for exploration, while updating a shared target policy simultaneously.
IVC1 Preliminaries on ActorCritic
The actorcritic algorithm [24], which comprises an actor and a critic learner, is one of the most popular learning algorithms based on the policy gradient methods. As shown in Fig. 4(a), the critic estimates the true Qfunction ${Q}^{{\pi}_{\bm{\theta}}}(\bm{s},\bm{a})$ of the actor’s policy ${\pi}_{\bm{\theta}}$ with a parameterized function ${Q}^{\bm{w}}(\bm{s},\bm{a})$, where $\bm{w}\in {\mathbb{R}}^{n}$ is the parameters to be learned based on the policy evaluation methods [24]. The actor, on the other hand, improves its policy ${\pi}_{\bm{\theta}}$ with the estimated Qfunction ${Q}^{\bm{w}}(s,a)$ based on the policy gradient methods. In this paper, we use Qlearning for policy evaluation in the critic, while using the extended OPDPG in (26) for the policy gradient in the actor. The critic and the actor will iteratively improve the estimated Qfunction ${Q}^{\bm{w}}(\bm{s},\bm{a})$ and the policy ${\pi}_{\bm{\theta}}$ until convergence.
IVC2 Parallel Learning Framework with Multiple Behavior Policies
As shown in Fig. 4(b), we employ multiple actorcritic RL agents to learn in parallel. Generally, each RL agent interacts with the environment by using a different behavior policy and transmits the calculated gradients to a centralized parameter server. The parameter server uses the local gradients to update a set of global parameters and periodically synchronize with the local parameters for consensus. The above procedures are performed repeatedly until convergence or some termination criteria are met. In particular, the detailed workflow for the iteration at time $t$ is presented as follows.
First, for each agent $m\in \mathcal{M}$ at time $t$, the local actor observes the current state ${s}_{t}$, and executes an action ${a}_{t}^{m}$ by following its own behavior policy with ${a}_{t}^{m}={\beta}_{m}({s}_{t}^{m})$. Afterwards, the current state ${s}_{t}^{m}$ shifts to ${s}_{t+1}^{m}$ and returns a reward ${r}_{t}^{m}$. The generated transition at time $t$ can be written as $({s}_{t}^{m},{a}_{t}^{m},{s}_{t+1}^{m},{r}_{t}^{m})$.
Second, each local critic evaluates the Qfunction of the target policy based on Qlearning [24]. Specifically, the gradient ${\mathrm{\Delta}}_{w}^{m}$ for the Qfunction updates can be written as
${\delta}_{t}^{m}$  $={r}_{t}^{m}+\gamma {Q}^{w}({s}_{t+1}^{m},{\pi}_{\theta}({s}_{t+1}^{m})){Q}^{w}({s}_{t}^{m},{a}_{t}^{m}),$  (27)  
${\mathrm{\Delta}}_{w}^{m}$  $={\delta}_{t}^{m}{\nabla}_{w}{Q}^{w}({s}_{t}^{m},{a}_{t}^{m}),$  (28) 
where ${\delta}_{t}^{m}$ is the temporal difference (TD) error, with ${\pi}_{\theta}({s}_{t+1}^{m})$ the action selected by the target policy ${\pi}_{\theta}$ at state ${s}_{t+1}^{m}$. Note that each local critic only needs to submit the gradient to the parameter server, without applying to the local Qfunction.
Third, each local actor computes the policy gradient according to the extended OPDPG theorem in (26). The gradient ${\mathrm{\Delta}}_{\theta}^{m}$ for the target policy updates can be written as
$${\mathrm{\Delta}}_{\theta}^{m}={{\nabla}_{\theta}{\pi}_{\theta}({s}_{t}^{m}){\nabla}_{a}{Q}^{w}({s}_{t}^{m},a)}_{a={\pi}_{\theta}({s}_{t}^{m})}.$$  (29) 
Similarly, each local actor only needs to submit the gradient to the parameter server, without applying to the local policy.
Finally, the parameter server uses the collected gradients to update a set of global parameters as
${w}_{t+1}^{g}$  $={w}_{t}^{g}+{\displaystyle \sum _{m=1}^{M}}{\alpha}_{w}{\mathrm{\Delta}}_{w}^{m},$  (30)  
${\theta}_{t+1}^{g}$  $={\theta}_{t}^{g}+{\displaystyle \sum _{m=1}^{M}}{\alpha}_{\theta}{\mathrm{\Delta}}_{\theta}^{m},$  (31) 
where ${\alpha}_{w}$ and ${\alpha}_{\theta}$ are the stepsizes. Afterwards, the local parameters ${w}_{t}^{m}$ and ${\theta}_{t}^{m}$ of all the agents $m\in \mathcal{M}$ (including all the actors and critics) are synchronized with the global parameters ${w}_{t+1}^{g}$ and ${\theta}_{t+1}^{g}$ for consensus, which completes this iteration.
IVC3 Convergence Analysis
Suppose that $p({s}^{\prime}s,a)$, ${\nabla}_{a}p({s}^{\prime}s,a)$, ${\mu}_{\theta}$, ${\nabla}_{\theta}{\mu}_{\theta}(s)$, $r(s,a)$, ${\nabla}_{a}r(s,a)$, ${p}_{1}(s)$ are continuous in variables $s,a$ and ${s}^{\prime}$, and that the state space $\mathcal{S}$ is a compact subset of ${\mathbb{R}}^{d}$, the convergence of the proposed parallel actorcritic learning framework can be reached when we use: i) compatible linear Qfunction approximators [25]; ii) the gradient temporaldifference (GTD) based learning method [29] for the critic updates.
The justification follows the one in [25]. First, the use of a compatible linear Qfunction approximator ensures that the deterministic policy gradient will not be affected when using the estimated gradient ${\nabla}_{a}{Q}^{w}(s,a)$ to replace the true gradient ${\nabla}_{a}{Q}^{\mu}(s,a)$. In particular, a function approximator that is compatible with the deterministic policy ${\mu}_{\theta}(s)$ is defined to satisfy [25]:

1.
${{\nabla}_{a}{Q}^{w}(s,a)}_{a={\mu}_{\theta}(s)}={\nabla}_{\theta}{\mu}_{\theta}{(s)}^{\top}w$;

2.
$w$ minimizes the meansquare error $\text{MSE}(\theta ,w)={\mathbb{E}}_{s\sim {\rho}^{{\beta}_{m}}}[\u03f5{(s;\theta ,w)}^{\top}\u03f5(s;\theta ,w)]$, where $\u03f5(s;\theta ,w)={{\nabla}_{a}{Q}^{w}(s,a)}_{a={\mu}_{\theta}(s)}{{\nabla}_{a}{Q}^{\mu}(s,a)}_{a={\mu}_{\theta}(s)}$.
Therefore, we have
$${\nabla}_{w}\u03f5(s;\theta ,w)={{\nabla}_{w}{\nabla}_{a}{Q}^{w}(s,a)}_{a={\mu}_{\theta}(s)}={\nabla}_{\theta}{\mu}_{\theta}(s).$$  (32) 
If $w$ can minimize the MSE, the gradient of MSE w.r.t. $w$ should satisfy
${\nabla}_{w}\text{MSE}(\theta ,w)$  $=2{\mathbb{E}}_{s\sim {\rho}^{{\beta}_{m}}}[{\nabla}_{w}\u03f5(s;\theta ,w)\u03f5(s;\theta ,w)]$  (33a)  
$=2{\mathbb{E}}_{s\sim {\rho}^{{\beta}_{m}}}[{\nabla}_{\theta}{\mu}_{\theta}(s)\u03f5(s;\theta ,w)]=0.$  (33b) 
According to the definition of $\u03f5(s;\theta ,w)$, we have
${\mathbb{E}}_{s\sim {\rho}^{{\beta}_{m}}}[{{\nabla}_{\theta}{\mu}_{\theta}(s){\nabla}_{a}{Q}^{w}(s,a)}_{a={\mu}_{\theta}(s)}]$  (34a)  
$=$  ${\mathbb{E}}_{s\sim {\rho}^{{\beta}_{m}}}[{{\nabla}_{\theta}{\mu}_{\theta}(s){\nabla}_{a}{Q}^{\mu}(s,a)}_{a={\mu}_{\theta}(s)}]$  (34b)  
$=$  ${\nabla}_{\theta}{J}_{{\beta}^{m}}({\mu}_{\theta}),$  (34c) 
which proves that the estimated gradient ${\nabla}_{a}{Q}^{w}(s,a)$ can replace the true gradient ${\nabla}_{a}{Q}^{\mu}(s,a)$ without affecting the deterministic policy gradient. Note that the above proof applies to any of the multiple behavior policies ${\beta}_{m}\in \mathcal{M}$.
Second, the GTD based policy evaluation has been proven to have sure convergence for the case with multiple behavior policies when using true gradients [29], which herein is also applicable when using estimated gradients based on the above proof. Hence, we can use the extended OPDPG theorem in (26) for the policy gradient, and use the diffusion offpolicy GTD in [29] for the policy evaluation to ensure the convergence. Note that, according to [29], the actor and the critic should be updated with a sufficiently small stepsize, in order to minimize the meansquared projected Bellman error (MSPBE).
In order to achieve a better generalization capability, in the following part, we further apply the nonlinear deep neural networks as the function approximators; and use Qlearning [24] for critic updates, instead of the diffusion offpolicy GTD [29], to reduce the training complexity. Consequently, the convergence is no longer theoretically guaranteed, which is a common issue for deep learning based models. But the empirical results show that our proposed parallel actorcritic learning framework with deep neural networks can still converge in practice.
IVD Fueled With Deep Neural Networks
{algorithm}\[email protected]@algorithmic[1] \STATEInput: \STATEfor $\forall m\in \mathcal{M}$, ${\alpha}_{w}^{m}={10}^{3}$, ${\alpha}_{\beta}^{m}={10}^{4}$; $\gamma =0.99$; $\tau =0.001$; $K=64$. \STATEInitialization: \STATERandom initialize the weights $w$ and $\theta $; for $\forall m\in \mathcal{M}$, set ${w}_{m}\leftarrow w,{\theta}_{m}\leftarrow \theta $, ${\widehat{w}}_{m}\leftarrow {w}_{m},{\widehat{\theta}}_{m}\leftarrow {\theta}_{m}$. \STATEIteration: \STATEObserve the start state ${s}_{1}^{m}$. \FOR$t=1,\mathrm{\infty}$ \FOR$m=1,M$ \STATESynchronize ${w}_{m}\leftarrow w,{\theta}_{m}\leftarrow \theta $. \STATEUpdate the target networks with (37) and (38). \STATEObserve ${s}_{t}^{m}$ and execute ${a}_{t}^{m}={\beta}_{m}({a}_{t}{s}_{t})$. \STATEReceive ${r}_{t}^{m}$ and shift to the next state ${s}_{t+1}^{m}$. \STATEStore the transition $({s}_{t}^{m},{a}_{t}^{m},{r}_{t}^{m},{s}_{t+1}^{m})$ in ${\mathcal{D}}_{m}$. \STATERandomly select $K$ transitions from ${\mathcal{D}}_{m}$. \STATECalculate the local ${\mathrm{\Delta}}_{w}^{m}$ for critic with (39). \STATECalculate the local ${\mathrm{\Delta}}_{\theta}^{m}$ for actor with (40). \STATESend ${\mathrm{\Delta}}_{w}^{m}$ and ${\mathrm{\Delta}}_{\theta}^{m}$ to the parameter server. \ENDFOR\STATEUpdate the global $w$ and $\theta $ with (30) and (31). \ENDFOR
Theoretical convergence and samplecomplexity analysis in RL have been mainly developed when using tabular or linear functions to approximate the Qfunctions [24]. The use of nonlinear function approximators, e.g., neural networks, has long been considered to be unstable or even leading to divergence in RL due to the strong correlations among the generated samples [30, 31, 32, 33]. However, recent successes on DRL have proven that using deep neural networks as function approximators is feasible when adopting the following two learning strategies [30]: 1) using minibatch learning along with an experience replay, which can randomize the generated data samples and smooth the changes over data distributions; 2) using guiding Qnetworks, which can reduce the correlation between the learning Qvalue $Q(s,a)$ and the guiding Qvalue $r({s}_{t},{a}_{t})+\gamma Q({s}_{t+1},{a}_{t+1})$, when deriving the gradients. In the sequel, we empower the proposed algorithm with the powerful deep neural networks, which are trained based on the aforementioned two learning strategies.
IVD1 Deep Architecture
We use two deep neural networks to respectively represent the policy function in the actor and the Qfunction in the critic to improve the generalization capability for MLB, which are referred to as the actornet and criticnet, respectively. The highlevel structure is presented in Fig. 3. Specifically, for the actornet, the policy network takes the system states (the highlevel features defined in Sec. IID) as the inputs, and outputs the CIOs. For the criticnet, the Qnetwork takes both the system states and the output CIOs from the policy network as the inputs, and generates a scalar Qvalue as the final output. The neurons are fully connected.
IVD2 Guiding Networks
We use the guiding networks to stabilize the learning process in the actor and the critic. Specifically, we introduce two more deep neural networks with the same structure as the criticnet and actornet, respectively, to be the guiding criticnet ${Q}^{{\widehat{w}}_{m}}(s,a)$ and the guiding actornet ${\pi}_{{\widehat{\theta}}_{m}}(s)$. The aim of the guiding network is to provide a stabilized learning target when updating the critics. In particular, according to the critic update rules in (27) and (28), the loss function to minimize is given by
$$\mathcal{L}({w}_{m})={\mathbb{E}}_{({s}_{i},{a}_{i},{r}_{i},{s}_{i+1})\in {\mathcal{D}}_{m}}\left[{\left({y}_{i}^{m}{Q}^{{w}_{m}}({s}_{i}^{m},{a}_{i}^{m})\right)}^{2}\right],$$  (35) 
where
$${y}_{i}^{m}=r({s}_{i}^{m},{a}_{i}^{m})+\gamma {Q}^{{\widehat{w}}_{m}}({s}_{i+1}^{m},{\pi}_{{\widehat{\theta}}_{m}}({s}_{i+1}^{m})).$$  (36) 
Here ${y}_{i}^{m}$ serves as the learning target for the criticnet. However, different from (27) and (28), the estimated Qvalue in ${y}_{i}^{m}$ is now generated by the guiding criticnet, with the action predicted by the guiding actornet, i.e., ${Q}^{{\widehat{w}}_{m}}({s}_{i+1}^{m},{\pi}_{{\widehat{\theta}}_{m}}({s}_{i+1}^{m}))$. The parameters of the guiding networks should slowly track the parameters of the learning networks, such that ${y}_{i}^{m}$ becomes a stabilized learning target for the criticnet. In this way, the training of the criticnet is closer to a supervised learning problem, where a robust solution likely exists. In practice, the guiding parameters ${\widehat{w}}_{m}$ and ${\widehat{\theta}}_{m}$ can be updated as
${\widehat{w}}_{m}^{t}$  $=\tau {w}_{m}^{t}+(1\tau ){\widehat{w}}_{m}^{t},$  (37)  
${\widehat{\theta}}_{m}^{t}$  $=\tau {\theta}_{m}^{t}+(1\tau ){\widehat{\theta}}_{m}^{t},$  (38) 
where $\tau \ll 1$ is a fixed stepsize.
IVD3 Minibatch Learning
We use the minibatch gradient descent to train the deep neural networks, which is a widely used technique in deep learning. Generally, in minibatch learning, the training dataset are first split into multiple minibatches of certain size. The gradients over one minibatch are summed or averaged to reduce the variance of the gradient. In this way, the minibatch gradient descent is promising to find a balance between the efficiency of stochastic gradient descent and the robustness of fullbatch gradient descent. In our model, each local agent $m\in \mathcal{M}$ has a local experience replay ${\mathcal{D}}^{m}$ to store the generated transitions $({s}_{t}^{m},{a}_{t}^{m},{r}_{t}^{m},{s}_{t+1}^{m})$ incrementally. At each time $t$, for each local agent $m\in \mathcal{M}$, we select $K$ transitions randomly from its local experience replay to form one minibatch. The gradients to minimize the loss function $\mathcal{L}({w}_{m})$ are averaged as
$${\mathrm{\Delta}}_{w}^{m}=\frac{1}{K}\sum _{i=1}^{K}\left[\left({y}_{i}^{m}{Q}^{{w}_{m}}({s}_{i}^{m},{a}_{i}^{m})\right){\nabla}_{{w}_{m}}{Q}^{{w}_{m}}({s}_{i}^{m},{a}_{i}^{m})\right].$$  (39) 
Similarly, the gradients to update the target policy are averaged as
$${\mathrm{\Delta}}_{\theta}^{m}={\frac{1}{K}\sum _{i=1}^{K}{\alpha}_{\theta}{\nabla}_{{\theta}_{m}}{\pi}_{{\theta}_{m}}({s}_{i}^{m}){\nabla}_{a}{Q}^{{w}_{m}}({s}_{i}^{m},a)}_{a={\pi}_{\theta}({s}_{i}^{m})}.$$  (40) 
IVD4 Asynchronous Update
Based on the proposed parallel actorcritic learning framework, we use a parameter server to collect the minibatch gradients ${\mathrm{\Delta}}_{w}^{m}$ and ${\mathrm{\Delta}}_{\theta}^{m}$ from the local agents to update a set of global parameters, and synchronize the local $w$ and $\theta $ of each agent with the global one for consensus. Note that the training of the deep neural networks can be performed in an asynchronous manner [32, 33], which relaxes the coordination requirements. However, the received gradients exceeding a preset maximum time delay should be filtered out to ensure the learning stability [32, 33].
The complete procedure of our proposed DRLbased intracluster MLB is presented in Algorithm IVD.
V OfflineEvaluation Based Online Control
One of the major drawbacks of using the datadriven machine learning based method for online control is that the system needs to take the risk of applying undertrained policies for control at the early learning stage, which is unstable and even dangerous to the network, e.g., following an undertrained MLB policy may lead to heavily overloaded situations due to random handovers. On the other hand, training the neural networks to readapt to the underlying UDN environments after each reclustering process requires a certain amount of time, which may disturb the online MLB performance. Therefore, in this section, we propose an offlineevaluation based safeguard mechanism which runs an online branch for system control and an offline branch for policy learning in parallel. The online branch always uses the best MLB policy found and welltrained by the offline branch. As such, the mechanism not only stabilizes the online performance, but also enables the exploration beyondcurrent or even over unexpected policies in a safe way. The philosophy behind is of vital importance to make full use of datadriven machine learning, which goes beyond existing methods to deliver superior alternatives.
Specifically, we run the proposed MLB algorithm over the offline learning branch and the online controlling branch in parallel. The learning branch runs in an offline manner, e.g., by using a network evaluation system based on historical performance records; the controlling branch runs in the online system, which only uses the best MLB policy found by the offline branch. For example, as shown in Fig. (5), the workflow at stage $k$ can be interpreted as follows. First, at the beginning of stage $k$, the offline learning branch triggers the reclustering process, and trains the neural networks to adapt to the new clustering result. Then, at the end of stage $k$, the safeguard evaluates the performance of the newly learned MLB policy over the previous stage $k$. If the newly learned policy can generate a better performance, the safeguard would replace the online policy with the new policy; otherwise, it keeps the current online policy in the online branch and continues to seek a better one in the offline branch at the next stage. In this way, we can keep exploring better MLB policies in the offline branch, while only executing the best and welltrained policy online, thus ensuring the online system to operate with a stable MLB policy.
VI Simulation Results
In this section, we evaluate the performance of the proposed twolayer MLB architecture along with the DRLbased MLB algorithm through simulations. We choose four baseline methods: 1) the static rulebased algorithm [10], which uses a fixed stepsize to tune the CIOs; 2) the adaptive rulebased algorithm [11], which uses an adaptive stepsize to tune CIOs; 3) the Qlearning based algorithm [17], which uses Qlearning to tune the CIOs between the most overloaded SBS and its neighboring SBSs; 4) a plain baseline without any MLB controls. We refer to them as the rulebased (static) algorithm, the rulebased (adaptive) algorithm, the Qlearning algorithm, and the noMLB algorithm, respectively.
The simulated SON scenario consists of $12$ SBSs randomly distributed in a $300m\times 300m$ area with $200$ users randomly walking at $1m/s\sim 10m/s$, each incurring a CBR traffic demand. The transmit power of each SBS is set to be $46dBm$. The path loss from each particular user to one particular SBS is modeled as $128.1+37.6\mathrm{log}(\mathrm{max}\{d,0.035\})$, where $d$ is the distance in $km$. The extra lognormal shadowing is modeled with a zero mean and a standard deviation of $8dB$. The handover hysteresis is set to be $3dB$. Both of the actornet and criticnet are composed with two hidden layers, each with $400$ and $300$ neurons, respectively. All hidden layers are followed by a rectified nonlinearity layer. The learning rate for actor and critic networks are fixed as ${10}^{4}$ and ${10}^{3}$, respectively. The discount factor $\gamma $ for rewards is set to be $0.99$. The $\tau $ value for the soft target network update is set to be $0.001$. The size of the minibatch is set to be $64$. The size of each local experience replay is set to be ${10}^{5}$. The simulations are performed on a workstation with eight Intel Xeon E3 CPU cores at 3.50 GHz and NVIDA GeForce GTX 970 with 4G graphics memory, Generally, the training of one DRL model with $10,000$ time steps can usually be completed within 30 minutes.
We consider three MLB performance indicators: 1) the reward function value; 2) the load standard deviation; 3) the handover failure ratio (HFR). Specifically, the HFR is obtained by introducing an admission control mechanism [10, 11] to block the incoming handover attempts if the SBS load exceeds 80%, concretely,
$$F=\frac{{N}_{\text{HOfail}}}{{N}_{\text{HOfail}}+{N}_{\text{HOsuccess}}},$$  (41) 
where ${N}_{\text{HOfail}}$ is the number of blocked handover attempts of all the SBSs and ${N}_{\text{HOsuccess}}$ is the number of successful handover attempts of all the SBSs. Moreover, the presented rewards are moving averaged to smooth out shortterm fluctuations and highlight longerterm trends, thus presenting a clear and sharp comparison. Specifically, the averaged reward at time $t$ is given by
$$\overline{r}(t)=\frac{1}{{T}_{A}}\sum _{i=t{T}_{A}+1}^{t}r(i),$$  (42) 
where ${T}_{A}$ is a fixed subset size, set to be $200$ in this paper.
The result in Fig. 6 presents the scalability of the comparing schemes under the CBR of $96kbps$. The result is averaged over $10,000$ time steps under $30$ different randomized SBS layouts to present a fair and general comparison. Fig. 6(a) presents the performance gain by using the noMLB method as the reward baseline. The result shows that that when performing MLB among only three SBSs, the centralized architecture generates a better performance due to global optimization. However, when performing MLB among more than three SBSs, the twolayer architecture outperforms the centralized architecture with a growing superiority. This verifies the scalability of our proposed twolayer architecture, and validates that the selforganized MLB is promising to find a better (local optimal) solution than the fully centralized MLB by breaking the large nonconvex MLB problem into smaller pieces that are easier to handle, as discussed in Sec. IIA. Meanwhile, the rulebased (static) and the rulebased (adaptive) algorithms are also scalable since they are both fully distributed algorithms. However, their performance is worse than our proposed algorithm due to the lack of cooperations. Fig. 6(b) presents the detailed averaged rewards obtained with six cells, which reveals that compared with the centralized MLB, the performance of the selforganized MLB is more stable and can converge remarkably faster at the early stage.
The result in Fig. 7 shows the moving averaged rewards of all the comparing schemes over $4,000$ time steps under a CBR requirement of $112$ kbps. The performance is averaged over $30$ different and random SBS topologies. The result shows that the proposed DRLbased algorithms can outperform the competitors considerably. Specifically, for our proposed DRLbased algorithm, we study two alternatives: 1) using a single behavior policy that is constructed by adding certain random noises to the target policy; 2) using three behavior policies, including the above noisy policy and two rulebased policies. We refer to them as the DRLSBP algorithm and the DRLMBP algorithm, respectively. The DRLSBP algorithm is trained centrally, while the DRLMBP algorithm is trained in parallel. The noMLB baseline in Fig. 7 indicates that the maximum SBS load fluctuates around 74% if no MLB attempts are made. For the rulebased algorithms, both of the rulebased (static) and rulebased (adaptive) algorithms can control the maximum SBS load to be under 67%. However, the rulebased (adaptive) algorithm can perform slightly better than the rulebased (static) one. For the Qlearning algorithm, due to the quantization of states and actions, the learned policy can hardly be applied to general UDN environments, which results in the same performance as the noMLB baseline. For our proposed DRLbased algorithm, both of the DRLSBP algorithm and the DRLMBP algorithm can control the maximum SBS load to be under 55%, where the DRLMBP algorithm can converge remarkably faster at the early stage and performs slightly better after the convergence.
The result in Fig. 8 shows the rewards obtained by using the proposed safeguard mechanism. The simulation is performed over $50,000$ time steps, where the reclustering is simply triggered with a period of $10,000$ time steps, i.e., about $3$ hours. Fig. 8(a) shows the performance achieved without using the safeguard mechanism. It is clear that the reclustering operation has a strong impact over the MLB performance. Fig. 8(b) shows the performance achieved by using the safeguard mechanism, which can be interpreted as follows. First, at the beginning of the first stage, the offline branch and online branch are initialized with the same parameters and start to learn. Next, at the beginning of the second stage, the offline branch triggers reclustering, and starts to explore a new MLB policy. However, the new policy is worse than the online policy, which is therefore discarded. Then, at the beginning of the third stage, the offline branch starts a new exploration and successfully finds a new policy that is better than the the online policy. Hence, the safeguard replace the online policy with the new policy along with the welltrained parameters at the beginning of the forth stage. In this way, the MLB performance of the online branch is ensured to be stable and nondecreasing as presented in Fig. 8(b).
The result in Fig. 9 reflects the global load distribution over the entire area. Specifically, Fig. 9(a) presents the HFR under different load burdens by varying the CBR from $48kbps$ to $112kbps$. Generally, a lower HFR means that the SBSs are less likely to be overloaded. The result in Fig. 9(a) shows that the HFO gradually increases with the increase of offered loads for all schemes. However, the proposed DRLbased algorithm can always achieve a lower HFR compared with the other methods. The result in Fig. 9(b) shows the load standard deviations under different load burdens. Generally, a lower load standard deviation means that the loads are more equally distributed among SBSs. The result in Fig. 9(b) shows that the proposed DRLbased algorithms can achieve low standard deviations as well. Additionally, the superiority of our proposed algorithms increases along with the load burdens. The HFR and load standard deviation results jointly verify that our proposed algorithm can achieve a more balanced global load distribution than the competitors.
VII Conclusion
In this paper, we proposed a DRLbased MLB algorithm along with a twolayer MLB architecture to solve the largescale load balancing issue in UDNs. The proposed twolayer architecture can handle the largescale MLB problem of UDNs in a selforganized manner, which makes it scalable to the number of SBSs. The proposed offpolicy DRLbased algorithm can be trained via an asynchronous parallel learning framework and employ multiple behavior policies for joint exploration to improve the learning efficiency. Moreover, we proposed an offpolicy evaluation based safeguard mechanism to improve the online control performance by ensuring that the online UDN system always operate with the optimal and welltrained MLB policy. Simulation results verified that 1) the proposed algorithm could outperform existing algorithms considerably, in terms of the MLB performance; 2) the proposed twolayer architecture achieves nice scalability over the number of SBSs, and outperforms the centralized architecture when dealing with a large number of SBSs; 3) the learning under multiple behavior policies converges remarkably faster than the learning under a single behavior policy; 4) the safeguard mechanism ensures the online performace to be stable and nondecreasing. Moreover, the proposed architecture, algorithm and mechanism are also promising to be applied over other largescale network control problems in the future networks. For example, for the access control problem in the IoT system studied in [34], one could first cluster the IoT devices into different groups and then determine their access control policies respectively. Meanwhile, the strategy of learning under multiple behavior policies and the safeguard mechanism can be exploited to further improve the learning efficiency and the online performance. More interesting applications can be explored in the future.
References
 [1] C. Pan, M. Elkashlan, J. Wang, J. Yuan, and L. Hanzo, “Usercentric CRAN architecture for ultradense 5G networks: Challenges and methodologies,” IEEE Commun. Mag., vol. 56, no. 6, pp. 14–20, June 2018.
 [2] Y. Li, Y. Zhang, K. Luo, T. Jiang, Z. Li, and W. Peng, “Ultradense hetnets meet big data: Green frameworks, techniques, and approaches,” IEEE Commun. Mag., vol. 56, no. 6, pp. 56–63, June 2018.
 [3] Y. Zhong, X. Ge, H. H. Yang, T. Han, and Q. Li, “Traffic matching in 5G ultradense networks,” IEEE Commun. Mag., vol. 56, no. 8, pp. 100–105, August 2018.
 [4] H. Zhang, L. Song, and Y. J. Zhang, “Load balancing for 5G ultradense networks using devicetodevice communications,” IEEE Trans. Wireless Commun., vol. 17, no. 6, pp. 4039–4050, June 2018.
 [5] I. Siomina and D. Yuan, “Optimization of pilot power for load balancing in WCDMA networks,” in IEEE Global Communications Conference (GLOBECOM), Dallas, Texas, USA, November 2004, pp. 3872–3876.
 [6] A. Wang and V. Krishnamurthy, “Mobility enhanced smart antenna adaptive sectoring for uplink capacity maximization in CDMA cellular network,” IEEE Trans. Commun., vol. 56, no. 5, pp. 743–753, May 2008.
 [7] M. Qin, Q. Yang, N. Cheng, H. Zhou, R. Rao, and X. Shen, “Machine learning aided contextaware selfhealing management for ultra dense networks with QoS provisions,” IEEE Trans. Veh. Technol., October 2018, to appear.
 [8] Z. Du, Y. Sun, W. Guo, Y. Xu, Q. Wu, and J. Zhang, “Datadriven deployment and cooperative selforganization in ultradense small cell networks,” IEEE Access, vol. 6, no. 1, pp. 22 839–22 848, April 2018.
 [9] M. Sheng, C. Yang, Y. Zhang, and J. Li, “Zonebased load balancing in LTE selfoptimizing networks: A gametheoretic approach,” IEEE Trans. Veh. Technol., vol. 63, no. 6, pp. 2916–2925, July 2014.
 [10] R. Kwan, R. Arnott, R. Paterson, and R. Trivisonno, “On mobility load balancing for LTE systems,” in IEEE Vehicular Technology Conference (VTC Fall), Taipei, Taiwan, September 2010, pp. 1–5.
 [11] Y. Yang, P. Li, X. Chen, and W. Wang, “A highefficient algorithm of mobile load balancing in LTE system,” in IEEE Vehicular Technology Conference (VTC Fall), Yokohama, Japan, September 2012, pp. 1–5.
 [12] J. Park, Y. Kim, and J. Lee, “Mobility load balancing method for selforganizing wireless networks inspired by synchronization and matching with preferences,” IEEE Trans. Veh. Technol., vol. 67, no. 3, pp. 2594–2606, March 2018.
 [13] S. Bi, R. Zhang, Z. Ding, and S. Cui, “Wireless communications in the era of big data,” IEEE Trans. Wireless Commun., vol. 53, no. 10, pp. 190–199, October 2015.
 [14] M. Chen, U. Challita, W. Saad, C. Yin, and M. Debbah, “Machine learning for wireless networks with artificial intelligence: A tutorial on neural networks,” arXiv preprint:1710.02913, October 2017. [Online]. Available: https://arxiv.org/abs/1710.02913
 [15] W. Xu, Y. Xu, C. H. Lee, Z. Feng, P. Zhang, and J. Lin, “Datacognitionempowered intelligent wireless networks: Data, utilities, cognition brain, and architecture,” IEEE Wireless Commun., vol. 25, no. 1, pp. 56–63, February 2018.
 [16] P. Muñoz, R. Barco, J. M. RuizAvilés, I. D. L. Bandera, and A. Aguilar, “Fuzzy rulebased reinforcement learning for load balancing techniques in enterprise LTE femtocells,” IEEE Trans. Veh. Technol., vol. 62, no. 5, pp. 1962–1973, June 2013.
 [17] S. S. Mwanje, L. C. Schmelz, and A. MitscheleThiel, “Cognitive cellular networks: A Qlearning framework for selforganizing networks,” IEEE Trans. Netw. Service Manag., vol. 13, no. 1, pp. 85–98, March 2016.
 [18] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, d. D. G. Van, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, and M. Lanctot, “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, January 2016.
 [19] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of Go without human knowledge,” Nature, vol. 550, no. 7676, p. 354, October 2017.
 [20] M. Mahmud, M. S. Kaiser, A. Hussain, and S. Vassanelli, “Applications of deep learning and reinforcement learning to biological data,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 6, pp. 2063–2079, June 2018.
 [21] Z. Wang, L. Li, Y. Xu, H. Tian, and S. Cui, “Handover control in wireless systems via asynchronous multiuser deep reinforcement learning,” IEEE Internet Things J., vol. 5, no. 6, pp. 4296–4307, June 2018.
 [22] Y. Xu, W. Xu, Z. Wang, J. Lin, and S. Cui, “Deep reinforcement learning based mobility load balancing under multiple behavior policies,” in IEEE International Conference on Communications (ICC), Shanghai, China, May 2019, to appear.
 [23] 3GPP, “TS 36.331 Evolved universal terrestrial radio access (EUTRAN); Radio resource control (RRC); Protocol specification,” Tech. Rep. Release 8, July 2009.
 [24] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
 [25] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in International Conference on Machine Learning (ICML), Beijing, China, June 2014, pp. 387–395.
 [26] S. Lloyd, “Least squares quantization in PCM,” IEEE Trans. Inf. Theory, vol. 28, no. 2, pp. 129–137, March 1982.
 [27] M. E. Celebi, H. A. Kingravi, and P. A. Vela, “A comparative study of efficient initialization methods for the kmeans clustering algorithm,” Expert systems with applications, vol. 40, no. 1, pp. 200–210, January 2013.
 [28] U. Maulik and S. Bandyopadhyay, “Performance evaluation of some clustering algorithms and validity indices,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 12, pp. 1650–1654, December 2002.
 [29] S. V. Macua, J. Chen, S. Zazo, and A. H. Sayed, “Distributed policy evaluation under multiple behavior strategies,” IEEE Trans. Autom. Control, vol. 60, no. 5, pp. 1260–1274, May 2015.
 [30] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, and G. Ostrovski, “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, February 2015.
 [31] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” in International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, May 2016, pp. 1–14.
 [32] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning (ICML), New York, USA, June 2016, pp. 1928–1937.
 [33] A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. D. Maria, V. Panneershelvam, M. Suleyman, C. Beattie, and S. Petersen, “Massively parallel methods for deep reinforcement learning,” arXiv preprint:1507.04296, July 2015. [Online]. Available: https://arxiv.org/abs/1507.04296
 [34] M. Chu, H. Li, X. Liao, and S. Cui, “Reinforcement learningbased multiaccess control and battery prediction with energy harvesting in IoT systems,” IEEE Internet Things J., vol. 6, no. 2, pp. 2009–2020, April 2019.