Abstract
This work demonstrates the potential of deep reinforcement learningtechniques for transmit power control in wireless networks. Existing techniquestypically find nearoptimal power allocations by solving a challengingoptimization problem. Most of these algorithms are not scalable to largenetworks in realworld scenarios because of their computational complexity andinstantaneous crosscell channel state information (CSI) requirement. In thispaper, a distributively executed dynamic power allocation scheme is developedbased on modelfree deep reinforcement learning. Each transmitter collects CSIand quality of service (QoS) information from several neighbors and adapts itsown transmit power accordingly. The objective is to maximize a weightedsumrate utility function, which can be particularized to achieve maximumsumrate or proportionally fair scheduling. Both random variations and delaysin the CSI are inherently addressed using deep Qlearning. For a typicalnetwork architecture, the proposed algorithm is shown to achieve nearoptimalpower allocation in real time based on delayed CSI measurements available tothe agents. The proposed scheme is especially suitable for practical scenarioswhere the system model is inaccurate and CSI delay is nonnegligible.
Quick Read (beta)
MultiAgent Deep Reinforcement Learning for Dynamic Power Allocation in Wireless Networks
Abstract
This work demonstrates the potential of deep reinforcement learning techniques for transmit power control in wireless networks. Existing techniques typically find nearoptimal power allocations by solving a challenging optimization problem. Most of these algorithms are not scalable to large networks in realworld scenarios because of their computational complexity and instantaneous crosscell channel state information (CSI) requirement. In this paper, a distributively executed dynamic power allocation scheme is developed based on modelfree deep reinforcement learning. Each transmitter collects CSI and quality of service (QoS) information from several neighbors and adapts its own transmit power accordingly. The objective is to maximize a weighted sumrate utility function, which can be particularized to achieve maximum sumrate or proportionally fair scheduling. Both random variations and delays in the CSI are inherently addressed using deep Qlearning. For a typical network architecture, the proposed algorithm is shown to achieve nearoptimal power allocation in real time based on delayed CSI measurements available to the agents. The proposed scheme is especially suitable for practical scenarios where the system model is inaccurate and CSI delay is nonnegligible.
I Introduction
In emerging and future wireless networks, intercell interference management is one of the key technological challenges as access points (APs) become denser to meet everincreasing demand on the capacity. A transmitter may increase its transmit power to improve its own data rate, but at the same time it may degrade links it interferes with. Transmit power control has been implemented since the first generation cellular networks [1]. Our goal here is to maximize an arbitrary weighted sumrate objective, which achieves maximum sumrate or proportionally fair scheduling as special cases.
A number of centralized and distributed optimization techniques have been used to develop algorithms for reaching a suboptimal power allocation [2, 3, 4, 1, 5, 6, 7]. We select two stateoftheart algorithms as benchmarks. These are the weighted minimum mean square error (WMMSE) algorithm [2] and an iterative algorithm based on fractional programming (FP) [3]. In their generic form, both algorithms require full uptodate crosscell channel state information (CSI). To the best of our knowledge, this work is the first to apply deep reinforcement learning to power control [8]. Sun et al. [9] proposed a centralized supervised learning approach to train a fast deep neural network (DNN) that achieves 90% or higher of the sumrate achieved by the WMMSE algorithm. However, this approach still requires acquiring the full CSI. Another issue is that training DNN depends on a massive dataset of the WMMSE algorithm’s output for randomly generated CSI matrices. Such a dataset takes a significant amount of time to produce due to WMMSE’s computational complexity. As the network gets larger, the total number of DNN’s input and output ports also increases, which raises questions on the scalability of the centralized solution of [9]. Furthermore, the success of supervised learning is highly dependent on the accuracy of the system model underlying the computed training data, which requires a new set of training data every time the system model or key parameters change.
In this work, we design a distributively executed algorithm to be employed by all transmitters to compute their best power allocation in real time. Such a dynamic power allocation problem with timevarying channel conditions for a different system model and network setup was studied in [10] and the delay performance of the classical dynamic backpressure algorithm was improved by exploiting the stochastic Lyapunov optimization framework.
The main contributions in this paper and some advantages of the proposed scheme are summarized as follows.

1.
The proposed algorithm is one of the first power allocation schemes to use deep reinforcement learning in the literature. In particular, the distributively executed algorithm is based on deep Qlearning [11], which is modelfree and robust to unpredictable changes in the wireless environment.

2.
The complexity of the distributively executed algorithm does not depend on the network size. In particular, the proposed algorithm is computationally scalable to networks that cover arbitrarily large geographical areas if the number of links per unit area remains upper bounded by the same constant everywhere.

3.
The proposed algorithm learns a policy that guides all links to adjust their power levels under important practical constraints such as delayed information exchange and incomplete crosslink CSI.

4.
Unlike the supervised learning approach [9], there is no need to run an existing nearoptimal algorithm to produce a large amount of training data. We use an applicable centralized network trainer approach that gathers local observations from all network agents. This approach is computationally efficient and robust. In fact, a pretrained neural network can also achieve comparable performance as that of the centralized optimization based algorithms.

5.
We compare the reinforcement learning outcomes with stateoftheart optimizationbased algorithms. We also show the scalability and the robustness of the proposed algorithm using simulations. In the simulation, we model the channel variations inconsequential to the learning algorithm using the Jakes fading model [12]. In certain scenarios the proposed distributed algorithm even outperforms the centralized iterative algorithms introduced in [2, 3]. We also address some important practical constraints that are not included in [2, 3].
Deep reinforcement learning framework has been used in some other wireless communications problems [13, 14, 15, 16]. Classical Qlearning techniques have been applied to the power allocation problem in [17, 18, 19, 20, 21]. The goal in [17, 18] is to reduce the interference in LTEFemtocells. Unlike the deep Qlearning algorithm, the classical algorithm builds a lookup table to represent the value of stateaction pairs, so [17] and [18] represent the wireless environment using a discrete state set and limit the number of learning agents. Amiri et al. [19] have used cooperative Qlearning based power control to increase the QoS of users in femtocells without considering the channel variations. The deep Qlearning based power allocation to maximize the network objective has also been considered in [20, 21]. Similar to the proposed approach, the work in [20, 21] is also based on a distributed framework with a centralized training assumption, but the benchmark to evaluate the performance of their algorithm was a fixed power allocation scheme instead of stateoftheart algorithms. The proposed approach to the state of wireless environment and the reward function is also novel and unique. Specifically, the proposed approach addresses the stochastic nature of wireless environment as well as incomplete/delayed CSI, and arrives at highly competitive strategies quickly.
The remainder of this paper is organized as follows. We give the system model in Section II. In Section III, we formulate the dynamic power allocation problem and give our practical constraints on the local information. In Section IV, we first give an overview of deep Qlearning and then describe the proposed algorithm. We give simulation results in Section V. We conclude with a discussion of possible future work in Section VI.
II System Model
We first consider the classical power allocation problem in a network of $n$ links. We assume that all transmitters and receivers are equipped with a single antenna. The model is often used to describe a mobile ad hoc network (MANET) [5]. The model has also been used to describe a simple cellular network with $n$ APs, where each AP serves a single user device [3, 4]. Let $N=\{1,\mathrm{\dots},n\}$ denote the set of link indexes. We consider a fully synchronized time slotted system with slot duration $T$. For simplicity, we consider a single frequency band with flat fading. We adopt a block fading model to denote the downlink channel gain from transmitter $i$ to receiver $j$ in time slot $t$ as
${g}_{i\to j}^{(t)}$  $={\left{h}_{i\to j}^{(t)}\right}^{2}{\alpha}_{i\to j},t=1,2,\mathrm{\dots}.$  (1) 
Here, ${\alpha}_{i\to j}\ge 0$ represents the largescale fading component including path loss and lognormal shadowing, which remains the same over many time slots. Following Jakes fading model [12], we express the smallscale Rayleigh fading component as a firstorder complex GaussMarkov process:
${h}_{i\to j}^{(t)}$  $=\rho {h}_{i\to j}^{(t1)}+\sqrt{1{\rho}^{2}}{e}_{i\to j}^{(t)}$  (2) 
where ${h}_{i\to j}^{(0)}$ and the channel innovation process ${e}_{i\to j}^{(1)},{e}_{i\to j}^{(2)},\mathrm{\dots}$ are independent and identically distributed circularly symmetric complex Gaussian (CSCG) random variables with unit variance. The correlation $\rho ={J}_{0}(2\pi {f}_{d}T)$, where ${J}_{0}(.)$ is the zerothorder Bessel function of the first kind and ${f}_{d}$ is the maximum Doppler frequency.
The received signaltointerferenceplusnoise ratio (SINR) of link $i$ in time slot $t$ is a function of the allocation $\bm{p}={[{p}_{1},\mathrm{\dots},{p}_{n}]}^{\u22ba}$:
${\gamma}_{i}^{(t)}\left(\bm{p}\right)$  $={\displaystyle \frac{{g}_{i\to i}^{(t)}{p}_{i}}{{\sum}_{j\ne i}{g}_{j\to i}^{(t)}{p}_{j}+{\sigma}^{2}}}$  (3) 
where ${\sigma}^{2}$ is the additive white Gaussian noise (AWGN) power spectral density (PSD). We assume the same noise PSD in all receivers without loss of generality. The downlink spectral efficiency of link $i$ at time $t$ can be expressed as:
$\begin{array}{cc}\hfill {C}_{i}^{(t)}\left(\bm{p}\right)& =\mathrm{log}\left(1+{\gamma}_{i}^{(t)}\left(\bm{p}\right)\right).\hfill \end{array}$  (4) 
The transmit power of transmitter $i$ in time slot $t$ is denoted as ${p}_{i}^{(t)}$. We denote the power allocation of the network in time slot $t$ as ${\bm{p}}^{(t)}={[{p}_{1}^{(t)},\mathrm{\dots},{p}_{n}^{(t)}]}^{\u22ba}$.
III Dynamic Power Control
We are interested in maximizing a generic weighted sumrate objective function. Specifically, the dynamic power allocation problem in slot $t$ is formulated as
$\begin{array}{cc}\hfill \underset{\bm{p}}{maximize}& {\displaystyle \sum _{i=1}^{n}}{w}_{i}^{(t)}\cdot {C}_{i}^{(t)}\left(\bm{p}\right)\hfill \\ \hfill \mathrm{subject}\mathrm{to}& 0\le {p}_{i}\le {P}_{\text{max}},i=1,\mathrm{\dots},n,\hfill \end{array}$  (5) 
where ${w}_{i}^{(t)}$ is the given nonnegative weight of link $i$ in time slot $t$, and ${P}_{\text{max}}$ is the maximum PSD a transmitter can emit. Hence, the dynamic power allocator has to solve an independent problem in the form of (5) at the beginning of every time slot. In time slot $t$, the optimal power allocation solution is denoted as ${\bm{p}}^{(t)}$. Problem (5) is in general nonconvex and has been shown to be NPhard [22].
We consider two special cases. In the first case, the objective is to maximize the sumrate by letting ${w}_{i}^{(t)}=1$ for all $i$ and $t$. In the second case, the weights vary in a controlled manner to ensure proportional fairness [23, 7]. Specifically, at the end of time slot $t$, receiver $i$ computes its weighted average spectral efficiency as
$\begin{array}{cc}\hfill {\overline{C}}_{i}^{(t)}& =\beta \cdot {C}_{i}^{(t)}\left({\bm{p}}^{(t)}\right)+(1\beta ){\overline{C}}_{i}^{(t1)}\hfill \end{array}$  (6) 
where $\beta \in (0,1]$ is used to control the impact of history. User $i$ updates its link weight as:
$\begin{array}{cc}\hfill {w}_{i}^{(t+1)}& ={\left({\overline{C}}_{i}^{(t)}\right)}^{1}.\hfill \end{array}$  (7) 
This power allocation algorithm maximizes the sum of logaverage spectral efficiency [23], i.e.,
$\begin{array}{c}\hfill {\displaystyle \sum _{i\in N}}\mathrm{log}{\overline{C}}_{i}^{(t)}\end{array},$  (8) 
where a user’s longterm average throughput is proportional to its longterm channel quality in some sense.
We use two popular (suboptimal) power allocation algorithms as benchmarks. These are the WMMSE algorithm [2] and the FP algorithm [3]. Both are centralized and iterative in their original form. The closedform FP algorithm used in this paper is formulated in [3, Algorithm 3]. Similarly, a detailed explanation and pseudo code of the WMMSE algorithm is given in [9, Algorithm 1]. The WMMSE and FP algorithms are both centralized and require full crosslink CSI. The centralized mechanism is suitable for a stationary environment with slowly varying weights and no fast fading. For a network with nonstationary environment, it is infeasible to instantaneously collect all CSI over a large network.
It is fair to assume that the feedback delay ${T}_{\text{fb}}$ from a receiver to its corresponding transmitter is much smaller than the slot duration $T$, so the prediction error due to the feedback delay is neglected. Therefore, once receiver $i$ completes a direct channel measurement, we assume that it is also available at the transmitter $i$.
For the centralized approach, once a link acquires the CSI of its direct channel and all other interfering channels to its receiver, passing this information to a central controller is another burden. This is typically resolved using a backhaul network between the APs and the central controller. The CSI of cross links is usually delayed or even outdated. Furthermore, the central controller can only return the optimal power allocation as the iterative algorithm converges, which is another limitation on the scalability.
Our goal is to design a scalable algorithm, so we limit the information exchange to between nearby transmitters. We define two neighborhood sets for every $i\in N$: Let the set of transmitters whose SNR at receiver $i$ was above a certain threshold $\eta $ during the past time slot $t1$ be denoted as
${I}_{i}^{(t)}=\left\{j\in N,j\ne i\right{g}_{j\to i}^{(t1)}{p}_{j}^{(t1)}>\eta {\sigma}^{2}\}.$  (9) 
Let the set of receiver indexes whose SNR from transmitter $i$ was above a threshold in slot $t1$ be denoted as
${O}_{i}^{(t)}=\left\{k\in N,k\ne i\right{g}_{i\to j}^{(t1)}{p}_{i}^{(t1)}>\eta {\sigma}^{2}\}.$  (10) 
From link $i$’s viewpoint, ${I}_{i}^{(t)}$ represents the set of “interferers”, whereas ${O}_{i}^{(t)}$ represents the set of the “interfered” neighbors.
We next discuss the local information a transmitter possesses at the beginning of time slot $t$. First, we assume that transmitter $i$ learns via receiver feedback the direct downlink channel gain, ${g}_{i\to i}^{(t)}$. Further, transmitter $i$ also learns the current total received interferenceplusnoise power at receiver $i$ before the global power update, i.e., ${\sum}_{j\in N,j\ne i}{g}_{j\to i}^{(t)}{p}_{j}^{(t1)}+{\sigma}^{2}$ (as a result of the new gains and the yettobeupdated powers). In addition, by the beginning of slot $t$, receiver $i$ has informed transmitter $i$ of the received power from every interferer $j\in {I}_{i}^{(t)}$, i.e., ${g}_{j\to i}^{(t)}{p}_{j}^{(t1)}$. These measurements can only be available at transmitter $i$ just before the beginning of slot $t$. Hence, in the previous slot $t1$, receiver $i$ also informs transmitter $i$ of the outdated versions of these measurements to be used in the information exchange process performed in slot $t1$ between transmitter $i$ and its interferers.
To clarify, as shown in Fig. 1, transmitter $i$ has sent the following outdated information to interferer $j\in {I}_{i}^{(t)}$ in return for ${w}_{j}^{(t1)}$ and ${C}_{j}^{(t1)}$:

•
the weight of link $i$, ${w}_{i}^{(t1)}$,

•
the spectral efficiency of link $i$ computed from (4), ${C}_{i}^{(t1)}$,

•
the direct gain, ${g}_{i\to i}^{(t1)}$,

•
the received interference power from transmitter $j$, ${g}_{j\to i}^{(t1)}{p}_{j}^{(t1)}$,

•
the total interferenceplusnoise power at receiver $i$, i.e., ${\sum}_{l\in N,l\ne i}{g}_{l\to i}^{(t1)}{p}_{l}^{(t1)}+{\sigma}^{2}$.
As assumed earlier, these measurements are accurate, where the uncertainty about the current CSI is entirely due to the latency of information exchange (one slot). By the same token, from every interfered $k\in {O}_{i}^{(t)}$, transmitter $i$ also obtains $k$’s items listed above.
IV Deep Reinforcement Learning for Dynamic Power Allocation
IVA Overview of Deep QLearning
A reinforcement learning agent learns its best policy from observing the rewards of trialanderror interactions with its environment over time[24, 25]. Let $S$ denote a set of possible states and $A$ denote a discrete set of actions. The state $s\in S$ is a tuple of environment’s features that are relevant to the problem at hand and it describes agent’s relation with its environment [20]. Assuming discrete time steps, the agent observes the state of its environment, ${s}^{(t)}\in S$ at time step $t$. It then takes an action ${a}^{(t)}\in A$ according to a certain policy $\pi $. The policy $\pi (s,a)$ is the probability of taking action $a$ conditioned on the current state being $s$. The policy function must satisfy ${\sum}_{a\in A}\pi (s,a)=1$. Once the agent takes an action ${a}^{(t)}$, its environment moves from the current state ${s}^{(t)}$ to the next state ${s}^{(t+1)}$. As a result of this transition, the agent gets a reward ${r}^{(t+1)}$ that characterizes its benefit from taking action ${a}^{(t)}$ at state ${s}^{(t)}$. This scheme forms an experience at time $t+1$, hereby defined as ${e}^{(t+1)}=({s}^{(t)},{a}^{(t)},{r}^{(t+1)},{s}^{(t+1)})$, which describes an interaction with the environment [11].
The wellknown Qlearning algorithm aims to compute an optimal policy $\pi $ that maximizes a certain expected reward without knowledge of the function form of the reward and the state transitions. Here we let the reward be the future cumulative discounted reward at time $t$:
$\begin{array}{cc}\hfill {R}^{(t)}& ={\displaystyle \sum _{\tau =0}^{\mathrm{\infty}}}{\gamma}^{\tau}{r}^{(t+\tau +1)}\hfill \end{array}$  (11) 
where $\gamma \in (0,1]$ is the discount factor for future rewards. In the stationary setting, we define a Qfunction associated with a certain policy $\pi $ as the expected reward once action $a$ is taken under state $s$[26], i.e.,
${Q}^{\pi}(s,a)$  $={\mathbb{E}}_{\pi}\left[{R}^{(t)}\right{s}^{(t)}=s,{a}^{(t)}=a].$  (12) 
As an action value function, the Qfunction satisfies a Bellman equation [27]:
${Q}^{\pi}(s,a)=\mathcal{R}(s,a)+\gamma {\displaystyle \sum _{{s}^{\prime}\in S}}{\mathcal{P}}_{s{s}^{\prime}}^{a}\left({\displaystyle \sum _{{a}^{\prime}\in A}}\pi ({s}^{\prime},{a}^{\prime}){Q}^{\pi}({s}^{\prime},{a}^{\prime})\right)$  (13) 
where $\mathcal{R}(s,a)=\mathbb{E}\left[{r}^{(t+1)}\right{s}^{(t)}=s,{a}^{(t)}=a]$ is the expected reward of taking action $a$ at state $s$, and ${\mathcal{P}}_{s{s}^{\prime}}^{a}=\mathrm{Pr}\left({s}^{(t+1)}={s}^{\prime}\right{s}^{(t)}=s,{a}^{(t)}=a)$ is the transition probability from given state $s$ to state ${s}^{\prime}$ with action $a$. From the fixedpoint equation (13), the value of $(s,a)$ can be recovered from all values of $({s}^{\prime},{a}^{\prime})\in S\times A$. It has been proved that some iterative approaches such as Qlearning algorithm efficiently converges to the action value function (12) [26]. Clearly, it suffices to let ${\pi}^{*}(s,a)$ be equal to 1 for the most favorable action. From (13), the optimal Qfunction associated with the optimal policy is then expressed as
${Q}^{*}(s,a)=\mathcal{R}(s,a)+\gamma {\displaystyle \sum _{{s}^{\prime}\in S}}{\mathcal{P}}_{s{s}^{\prime}}^{a}\underset{{a}^{\prime}}{\mathrm{max}}{Q}^{*}({s}^{\prime},{a}^{\prime}).$  (14) 
The classical Qlearning algorithm constructs a lookup table, $q(s,a)$, as a surrogate of the optimal Qfunction. Once this lookup table is randomly initialized, the agent takes actions according to the $\u03f5$greedy policy for each time step. The $\u03f5$greedy policy implies that with probability $1\u03f5$ the agent takes the action ${a}^{*}$ that gives the maximum lookup table value for a given current state, whereas it picks a random action with probability $\u03f5$ to avoid getting stuck at nonoptimal policies [11]. After acquiring a new experience as a result of the taken action, the Qlearning algorithm updates a corresponding entry of the lookup table according to:
$\begin{array}{cc}\hfill q({s}^{(t)},{a}^{(t)})& \leftarrow (1\alpha )q({s}^{(t)},{a}^{(t)})\hfill \\ & +\alpha \left({r}^{(t+1)}+\gamma \underset{{a}^{\prime}}{\mathrm{max}}q({s}^{(t+1)},{a}^{\prime})\right)\hfill \end{array}$  (15) 
where $\alpha \in (0,1]$ is the learning rate [26].
In case the state and action spaces are very large, as is the case for the power control problem at hand. The classical Qlearning algorithm fails mainly because of two reasons:

1.
Many states are rarely visited, and
 2.
Both issues can be solved with deep reinforcement learning, e.g., deep Qlearning [11]. A deep neural network called deep Qnetwork (DQN) is used to estimate the Qfunction in lieu of a lookup table. The DQN can be expressed as $q(s,a,\bm{\theta})$, where the realvalued vector $\bm{\theta}$ represents its parameters. The essence of DQN is that the function $q(\cdot ,\cdot ,\bm{\theta})$ is completely determined by $\bm{\theta}$. As such, the task of finding the best Qfunction in a functional space of uncountably many dimensions is reduced to searching the best $\bm{\theta}$ of finite dimensions. Similar to the classical Qlearning, the agent collects experiences with its interaction with the environment. The agent or the network trainer forms a data set $D$ by collecting the experiences until time $t$ in the form of $(s,a,{r}^{\prime},{s}^{\prime})$. As the “quasistatic target network” method [11] implies, we define two DQNs: the target DQN with parameters ${\bm{\theta}}_{\text{target}}^{(t)}$ and the train DQN with parameters ${\bm{\theta}}_{\text{train}}^{(t)}$. ${\bm{\theta}}_{\text{target}}^{(t)}$ is updated to be equal to ${\bm{\theta}}_{\text{train}}^{(t)}$ once every ${T}_{u}$ steps. From the “experience replay” [11], the least squares loss of train DQN for a random minibatch ${D}^{(t)}$ at time $t$ is
$L\left({\bm{\theta}}_{\text{train}}^{(t)}\right)$  $={\displaystyle \sum _{(s,a,{r}^{\prime},{s}^{\prime})\in {D}^{(t)}}}{\left({y}_{DQN}^{(t)}({r}^{\prime},{s}^{\prime})q(s,a;{\bm{\theta}}_{\text{train}}^{(t)})\right)}^{2}$  (16) 
where the target is
${y}_{DQN}^{(t)}({r}^{\prime},{s}^{\prime})={r}^{\prime}+\lambda \underset{{a}^{\prime}}{\mathrm{max}}q({s}^{\prime},{a}^{\prime};{\bm{\theta}}_{\text{target}}^{(t)}).$  (17) 
Finally, we assume that each time step the stochastic gradient descent algorithm that minimizes the loss function (16) is used to train the minibatch ${D}^{(t)}$. The stochastic gradient descent returns the new parameters of train DQN using the gradient computed from just few samples of the dataset and has been shown to converge to a set of good parameters quickly [29].
IVB Proposed MultiAgent Deep Reinforcement Learning Algorithm
As depicted in Fig. 2, we propose a multiagent deep reinforcement learning scheme with each transmitter as an agent. Similar to [30], we define the local state of learning agent $i$ as ${s}_{i}\in {S}_{i}$ which is composed of environment features that are relevant to agent $i$’s action ${a}_{i}\in {A}_{i}$. In the multiagent learning system, the state transitions of their common environment depend on the agents’ joint actions. An agent’s environment transition probabilities in (13) may not be stationary as other learning agents update their policies. The Markov property introduced for the singleagent case in Section IVA no longer holds in general [31]. This “environment nonstationarity” issue may cause instability during the learning process. One way to tackle the issue is to train a single meta agent with a DQN that outputs joint actions for the agents [32]. The complexity of the stateaction space, and consequently the DQN complexity, will then be proportional to the total number of agents in the system. The singlemeta agent approach is not suitable for our dynamic setup and the distributed execution framework, since its DQN can only forward the action assignments to the transmitters after acquiring the global state information. There is an extensive research to develop multiagent learning frameworks and there exists several multiagent Qlearning adaptations [33, 31]. However, multiagent learning is an open research area and theoretical guarantees for these adaptations are rare and incomplete despite their good empirical performances [33, 31].
In this work, we take an alternative approach where the DQNs are distributively executed at the transmitters, whereas training is centralized to ease implementation and to improve stability. Each agent $i$ has the same copy of the DQN with parameters ${Q}_{\text{target}}^{(t)}$ at time slot $t$. The centralized network trainer trains a single DQN by using the experiences gathered from all agents. This significantly reduces the amount of memory and computational resources required by training. The centralized training framework is also similar to the parameter sharing concept which allows the learning algorithm to draw advantage from the fact that agents are learning together for faster convergence [34]. Since agents are working collaboratively to maximize the global objective in (5) with an appropriate reward function design to be discussed in Section IVE, each agent can benefit from experiences of others. Note that sharing the same DQN parameters still allows different behavior among agents, because they execute the same DQN with different local states as input.
As illustrated in Fig. 2, at the beginning of time slot $t$, agent $i$ takes action ${a}_{i}^{(t)}$ as a function of ${s}_{i}^{(t)}$ based on the current decision policy. All agents are synchronized and take their actions at the same time. Prior to taking action, agent $i$ has observed the effect of the past actions of its neighbors on its current state, but it has no knowledge of ${a}_{j}^{(t)}$, $\forall j\ne i$. From the past experiences, agent $i$ is able to acquire an estimation of what is the impact of its own actions on future actions of its neighbors, and it can determine a policy that maximizes its discounted expected future reward with the help of deep Qlearning.
The proposed DQN is a fullyconnected deep neural network [35, Chapter 5] that consists of five layers as shown in Fig. (a)a. The first layer is fed by the input state vector of length ${N}_{0}$. We relegate the detailed design of the state vector elements to Section IVC. The input layer is followed by three hidden layers with ${N}_{1}$, ${N}_{2}$, and ${N}_{3}$ neurons, respectively. At the output layer, each port gives an estimate of the Qfunction with given state input and the corresponding action output. The total number of DQN output ports is denoted as ${N}_{4}$ which is equal to the cardinality of the action set to be described in Section IVD. The agent finds the action that has the maximum value at the DQN output and takes this action as its transmit power.
In Fig. (a)a, we also depicted the connection between these layers by using the weights and biases of the DQN which form the set of parameters. The total number of scalar parameters in the fully connected DQN is
$\theta ={\displaystyle \sum _{l=0}^{3}}\left({N}_{l}+1\right){N}_{l+1}.$  (18) 
In addition, Fig. (b)b describes the functionality of a single neuron which applies a nonlinear activation function to its combinatorial input.
During the training stage, in each time slot, the trainer randomly selects a minibatch ${D}^{(t)}$ of ${M}_{b}$ experiences from an experiencereplay memory [11] that stores the experiences of all agents. The experiencereplay memory is a FIFO queue [15] with a length of $n{M}_{m}$ samples where $n$ is the total number of agents, i.e., a new experience replaces the oldest experience in the queue and the queue length is proportional to the number of agents. At time slot $t$ the most recent experience from agent $i$ is ${e}_{i}^{(t1)}=({s}_{i}^{(t2)},{a}_{i}^{(t2)},{r}_{i}^{(t1)},{s}_{i}^{(t1)})$ due to delay. Once the trainer picks ${D}^{(t)}$, it updates the parameters to minimize the loss in (16) using an appropriate optimizer, e.g., the stochastic gradient descent method [29]. As also explained in Fig. 2, once per ${T}_{u}$ time slots, the trainer broadcasts the latest trained parameters. The new parameters are available at the agents after ${T}_{d}$ time slots due to the transmission delay through the backhaul network. Training may be terminated once the parameters converge.
IVC States
As described in Section III, agent $i$ builds its state ${s}_{i}^{(t)}$ using information from the interferer and interfered sets given by (9) and (10), respectively. To better control the complexity, we set $\left{\overline{I}}_{i}^{(t)}\right=\left{\overline{O}}_{i}^{(t)}\right=c$, where $c>0$ is the restriction on the number of interferers and interfereds the AP communicating with. At the beginning of time slot $t$, agent $i$ sorts its interferers by current received power from interferer $j\in {I}_{i}^{(t)}$ at receiver $i$, i.e., ${g}_{j\to i}^{(t)}{p}_{j}^{(t1)}$. This sorting process allows agent $i$ to prioritize its interferers. As $\left{I}_{i}^{(t)}\right>c$, we want to keep strong interferers which have higher impact on agent $i$’s next action. On the other hand, if $$, agent $i$ adds $\left{I}_{i}^{(t)}\rightc$ virtual noise agents to ${I}_{i}^{(t)}$ to fit the fixed DQN. A virtual noise agent is assigned an arbitrary negative weight and spectral efficiency. Its downlink and interfering channel gains are taken as zero in order to avoid any impact on agent $i$’s decisionmaking. The purpose of having these virtual agents as placeholders is to provide inconsequential inputs to fill the input elements of fixed length, like ‘padding zeros’. After adding virtual noise agents (if needed), agent $i$ takes first $c$ interferers to form ${\overline{I}}_{i}^{(t)}$. For the interfered neighbors, agent $i$ follows a similar procedure, but this time the sorting criterion is the share of agent $i$ on the interference at receiver $k\in {O}_{i}^{(t)}$, i.e., ${g}_{i\to k}^{(t1)}{p}_{i}^{(t1)}{\left({\sum}_{j\in N,j\ne k}{g}_{j\to k}^{(t1)}{p}_{j}^{(t1)}+{\sigma}^{2}\right)}^{1}$, in order to give priority to the most significantly affected interfered neighbors by agent $i$’s interference.
The way we organize the local information to build ${s}_{i}^{(t)}$ accommodates some intuitive and systematic basics. Based on these basics, we perfected our design by trialanderror with some preliminary simulations. We now describe the state of agent $i$ at time slot $t$, i.e., ${s}_{i}^{(t)}$, by dividing it into three main feature groups as:
IVC1 Local Information
The first element of this feature group is agent $i$’s transmit power during previous time slot, i.e., ${p}_{i}^{(t1)}$. Then, this is followed by the second and third elements that specify agent $i$’s most recent potential contribution on the network objective (5): $1/{w}_{i}^{(t)}$ and ${C}_{i}^{(t1)}$. For the second element, we do not directly use ${w}_{i}^{(t)}$ which tends to be quite large as ${\overline{C}}_{i}^{(t)}$ is close to zero from (7). We found that using $1/{w}_{i}^{(t)}$ is more desirable. Finally, the last four elements of this feature group are the last two measurements of its direct downlink channel and the total interferenceplusnoise power at receiver $i$: ${g}_{i\to i}^{(t)}$, ${g}_{i\to i}^{(t1)}$, ${\sum}_{j\in N,j\ne i}{g}_{j\to i}^{(t)}{p}_{j}^{(t1)}+{\sigma}^{2}$, and ${\sum}_{j\in N,j\ne i}{g}_{j\to i}^{(t1)}{p}_{j}^{(t2)}+{\sigma}^{2}$. Hence, a total of seven input ports of the input layer are reserved for this feature group. In our state set design, we take the last two measurements into account to give the agent a better chance to track its environment change. Intuitively, the lower the maximum Doppler frequency, the slower the environment changes, so that having more past measurements will help the agent to make better decisions [15]. On the other hand, this will result with having more state information which may increase the complexity and decrease the learning efficiency. Based on preliminary simulations, we include two past measurements.
IVC2 Interfering Neighbors
This feature group lets agent $i$ observe the interference from its neighbors to receiver $i$ and what is the contribution of these interferers on the objective (5). For each interferer $j\in {\overline{I}}_{i}^{(t)}$, three input ports are reserved for ${g}_{j\to i}^{(t)}{p}_{j}^{(t1)}$, $1/{w}_{j}^{(t1)}$, ${C}_{j}^{(t1)}$. The first term indicates the interference that agent $i$ faced from its interferer $j$; the other two terms imply the significance of agent $j$ in the objective (5). Similar to the local information feature explained in the previous paragraph, agent $i$ also considers the history of its interferers in order to track changes in its own receiver’s interference condition. For each interferer ${j}^{\prime}\in {\overline{I}}_{i}^{(t1)}$, three input ports are reserved for ${g}_{{j}^{\prime}\to i}^{(t1)}{p}_{{j}^{\prime}}^{(t2)}$, $1/{w}_{{j}^{\prime}}^{(t2)}$, ${C}_{{j}^{\prime}}^{(t2)}$. A total of $6c$ state elements are reserved for this feature group.
IVC3 Interfered Neighbors
Finally, agent $i$ uses the feedback from its interfered neighbors to gauge its interference to nearby receivers and the contribution of them on the objective (5). If agent $i$’s link was inactive during the previous time slot, then ${O}_{i}^{(t1)}=\mathrm{\varnothing}$. For this case, if we ignore the history and directly consider the current interfered neighbor set, the corresponding state elements will be useless. Note that agent $i$’s link became inactive when its own estimated contribution on the objective (5) was not significant enough compared to its interference to its interfered neighbors. Thus, after agent $i$’s link became inactive, in order to decide when to reactivate its link, it should keep track of the interfered neighbors that implicitly silenced itself. We solve this issue by defining time slot ${t}_{i}^{\prime}$ which is the last time slot agent $i$ was active. The agent $i$ carries the feedback from interfered $k\in {\overline{O}}_{i}^{({t}_{i}^{\prime})}$. We also pay attention to the fact that if $$, interfered $k$ has no knowledge of ${g}_{i\to k}^{(t1)}$, but it is still able to send its local information to agent $i$. Therefore, agent $i$ reserves four elements of its state set for each interfered $k\in {O}_{i}^{({t}_{i}^{\prime})}$ as ${g}_{k\to k}^{(t1)}$, $1/{w}_{k}^{(t1)}$, ${C}_{k}^{(t1)}$, and ${g}_{i\to k}^{({t}_{i}^{\prime})}{p}_{i}^{({t}_{i}^{\prime})}{\left({\sum}_{j\in N,j\ne k}{g}_{j\to k}^{(t1)}{p}_{j}^{(t1)}+{\sigma}^{2}\right)}^{1}$. This makes a total of $4c$ elements of the state set reserved for the interfered neighbors.
IVD Actions
Unlike taking discrete steps on the previous transmit power level (see, e.g., [20]), we use discrete power levels taken between $0$ and ${P}_{\text{max}}$. All agents have the same action space, i.e., ${A}_{i}={A}_{j}=A$, $\forall i,j\in N$. Suppose we have $A>1$ discrete power levels. Then, the action set is given by
$\begin{array}{cc}\hfill A& =\{0,{\displaystyle \frac{{P}_{\text{max}}}{A1}},{\displaystyle \frac{2{P}_{\text{max}}}{A1}},\mathrm{\dots},{P}_{\text{max}}\}\hfill \end{array}.$  (19) 
The total number of DQN output ports denoted as ${N}_{4}$ in Fig. (a)a is equal to $A$. Agent $i$ is only allowed to pick an action ${a}_{i}(t)\in A$ to update its power strategy at time slot $t$. This way of approaching the problem could increase the number of DQN output ports compared to [20], but it will increase the robustness of the learning algorithm. For example, as the maximum Doppler frequency ${f}_{d}$ or time slot duration $T$ increases, the correlation term $\rho $ in (2) is going to decrease and the channel state will vary more. This situation may require the agents to react faster, i.e., possible transition from zeropower to fullpower, which can be addressed efficiently with an action set composed of discrete power levels.
IVE Reward Function
The reward function is designed to optimize the network objective (5). We interpret the reward as how the action of agent $i$ through time slot $t$, i.e., ${p}_{i}^{(t)}$, affects the weighted sumrate of its own and its future interfered neighbors ${O}_{i}^{(t+1)}$. During the time slot $t+1$, for all agent $i\in N$, the network trainer calculates the spectral efficiency of each link $k\in {O}_{i}^{(t+1)}$ without the interference from transmitter $i$ as
$\begin{array}{cc}\hfill {C}_{k\setminus i}^{(t)}& =\mathrm{log}\left(1+{\displaystyle \frac{{g}_{k\to k}^{(t)}{p}_{k}^{(t)}}{{\sum}_{j\ne i,k}{g}_{j\to k}^{(t)}{p}_{j}^{(t)}+{\sigma}^{2}}}\right).\hfill \end{array}$  (20) 
The network trainer computes the term ${\sum}_{j\ne i,k}{g}_{j\to k}^{(t)}{p}_{j}^{(t)}+{\sigma}^{2}$ in (20) by simply subtracting ${g}_{i\to k}^{(t)}{p}_{i}^{(t)}$ from the total interferenceplusnoise power at receiver $k$ in time slot $t$. As assumed in Section III, since transmitter $i\in {I}_{k}^{(t+1)}$, its interference to link $k$ in slot $t$, i.e., ${g}_{i\to k}^{(t)}{p}_{i}^{(t)}>\eta {\sigma}^{2}$, is accurately measurable by receiver $k$ and has been delivered to the network trainer.
In time slot $t$, we account for the externality that link $i$ causes to link $k$ using a price charged to link $i$ for generating interference to link $k$ [5]:
$\begin{array}{cc}\hfill {\pi}_{i\to k}^{(t)}& ={w}_{k}^{(t)}\left({C}_{k\setminus i}^{(t)}{C}_{k}^{(t)}\right).\hfill \end{array}$  (21) 
Then, the reward function of agent $i\in N$ at time slot $t+1$ is defined as
$\begin{array}{cc}\hfill {r}_{i}^{(t+1)}& ={w}_{i}^{(t)}{C}_{i}^{(t)}{\displaystyle \sum _{k\in {O}_{k}^{(t+1)}}}{\pi}_{i\to k}^{(t)}.\hfill \end{array}$  (22) 
The reward of agent $i$ consists of two main components: its direct contribution to the network objective (5) and the penalty due to its interference to all interfered neighbors. Evidently, transmitting at peak power ${p}_{i}^{(t)}={P}_{\text{max}}$ maximizes the direct contribution as well as the penalty, whereas being silent earns zero reward.
V Simulation Results
VA Simulation Setup
To begin with, we consider $n$ links on $n$ homogeneously deployed cells, where we choose $n$ to be between 19 and 100. Transmitter $i$ is located at the center of cell $i$ and receiver $i$ is located randomly within the cell. We also discuss the extendability of our algorithm to multilink per cell scenarios in Section VB. The half transmittertotransmitter distance is denoted as $R$ and it is between 100 and 1000 meters. We also define an inner region of radius $r$ where no receiver is allowed to be placed. We set the $r$ to be between $10$ and $R1$ meters. Receiver $i$ is placed randomly according to a uniform distribution on the area between out of the inner region of radius $r$ and the cell boundary. Fig. 4 shows two network configuration examples.
We set ${P}_{\text{max}}$, i.e., the maximum transmit power level of transmitter $i$, to 38 dBm over 10 MHz frequency band which is fully reusable across all links. The distance dependent path loss between all transmitters and receivers is simulated by $120.9+37.6{\mathrm{log}}_{10}(d)$ (in dB), where $d$ is transmittertoreceiver distance in km. This path loss model is compliant with the LTE standard [36]. The lognormal shadowing standard deviation is taken as 8 dB. The AWGN power ${\sigma}^{2}$ is 114 dBm. We set the threshold $\eta $ in (9) and (10) to 5. We assume fullbuffer traffic model. Similar to [37], if the received SINR is greater than 30 dB, it is capped at 30 dB in the calculation of spectral efficiency by (4). This is to account for typical limitations of finiteprecision digital processing. In addition to these parameters, we take the period of the timeslotted system $T$ to be 20 ms. Unless otherwise stated, the maximum Doppler frequency ${f}_{d}$ is 10 Hz and identical for all receivers.
We next describe the hyperparameters used for the architecture of our algorithm. Since our goal is to ensure that the agents make their decisions as quickly as possible, we do not overparameterize the network architecture and we use a relatively small network for training purposes. Our algorithm trains a DQN with one input layer, three hidden layers, and one output layer. The hidden layers have ${N}_{1}=200$, ${N}_{2}=100$, and ${N}_{3}=40$ neurons, respectively. We have $7$ DQN input ports reserved for the local information feature group explained in Section IVC. The cardinality constraint on the neighbor sets $c$ is 5 agents. Hence, again from Section IVC, the input ports reserved for the interferer and the interfered neighbors are $6c=30$ and $4c=20$, respectively. This makes a total of ${N}_{0}=57$ input ports reserved for the state set. (We also normalize the inputs with some constants depending on ${P}_{\text{max}}$, maximum intracell path loss, etc., to optimize the performance.) We use ten discrete power levels, ${N}_{4}=A=10$. Thus, the DQN has ten outputs. Initial parameters of the DQN are generated with the truncated normal distribution function of the TensorFlow [38]. For our application, we observed that the rectifier linear unit (ReLU) function converges to a desirable power allocation slightly slower than the hyperbolic tangent (tanh) function, so we used tanh as DQN’s activation function. Memory parameters at the network trainer, ${M}_{b}$ and ${M}_{m}$, are 256 and 1000 samples, respectively. We use the RMSProp algorithm [39] with an adaptive learning rate ${\alpha}^{(t)}$. For a more stable deep Qlearning outcome, the learning rate is reduced as ${\alpha}^{(t+1)}=\lambda {\alpha}^{(t)}$, where $\lambda \in (0,1)$ is the decay rate of ${\alpha}^{(t)}$ [40]. Here, ${\alpha}^{(0)}$ is $5\times {10}^{3}$ and $\lambda $ is ${10}^{4}$. We also apply adaptive $\u03f5$greedy algorithm: ${\u03f5}^{(0)}$ is initialized to 0.2 and it follows ${\u03f5}^{(t+1)}=\mathrm{max}\{{\u03f5}_{\text{min}},{\lambda}_{\u03f5}{\u03f5}^{(t)}\}$, where ${\u03f5}_{\text{min}}={10}^{2}$ and ${\lambda}_{\u03f5}={10}^{4}$.
Although the discount factor $\gamma $ is nearly arbitrarily chosen to be close to 1 and increasing $\gamma $ potentially improves the outcomes of deep Qlearning for most of its applications [40], we set $\gamma $ to 0.5. The reason we use a moderate level of $\gamma $ is that the correlation between agent’s actions and its future rewards tends to be smaller for our application due to fading. An agent’s action has impact on its own future reward through its impact on the interference condition of its neighbors and consequences of their unpredictable actions. Thus, we set $\gamma \ge 0.5$. We observed that higher $\gamma $ is not desirable either. It slows the DQN’s reaction to channel changes, i.e., high ${f}_{d}$ case. For high $\gamma $, the DQN converges to a strategy that makes the links with better steadystate channel condition greedy. As ${f}_{d}$ becomes large, due to fading, the links with poor steadystate channel condition may become more advantageous for some timeslots. Having a moderate level of $\gamma $ helps detect these cases and allows poor links to be activated during these time slots when they can contribute the network objective (5). Further, the training cycle duration ${T}_{u}$ is 100 time slots. After we set the parameters in (18), we can compute the total number of DQN parameters, i.e., $\theta $, as 36,150 parameters. After each ${T}_{u}$ time slots, trained parameters at the central controller will be delivered to all agents in ${T}_{d}$ time slots via backhaul network as explained in Section IVB. We assume that the parameters are transferred without any compression and the backhaul network uses pure peertopeer architecture. As ${T}_{d}=50$ time slots, i.e., 1 second, the minimum required downlink/uplink capacity for all backhaul links is about 1 Mbps. Once the training stage is completed, the backhaul links will be used only for limited information exchange between neighbors which requires negligible backhaul link capacity.
We empirically validate the functionality of our algorithm. We implemented the proposed algorithm with TensorFlow [38]. Each result is an average of at least 10 randomly initialized simulations. We have two main phases for the simulations: training and testing. Each training lasts 40,000 time slots or $40,000\times 20$ ms = 800 seconds, and each testing lasts 5,000 time slots or 100 seconds. During the testing, the trainer leaves the network and the $\u03f5$greedy algorithm is terminated, i.e., agents stop exploring the environment.
We have five benchmarks to evaluate the performance of our algorithm. The first two benchmarks are ‘ideal WMMSE’ and ‘ideal FP’ with instantaneous full CSI and centralized algorithm outcome. The third benchmark is the ‘central power allocation’ (central), where we introduce one time slot delay on the full CSI and feed it to the FP algorithm. Even the single time slot delay to acquire the full CSI is a generous assumption, but it is a useful approach to reflect potential performance of negligible computation time achieved with the supervised learning approach introduced in [9]. The next benchmark is the ‘random’ allocation, where each agent chooses its transmit power for each slot at random uniformly between 0 and ${P}_{\text{max}}$. The last benchmark is the ‘fullpower’ allocation, i.e., each agent’s transmit power is ${P}_{\text{max}}$ for all slots.
VB SumRate Maximization
In this subsection, we focus on the sumrate by setting the weights of all network agents to 1 through all time slots.
VB1 Robustness
We fix $n=19$ links and use two approaches to evaluate performance. The first approach is the ‘matched’ DQN where we use the first 40,000 time slots to train a DQN from scratch, whereas for the ‘unmatched’ DQN we ignore the matched DQN specialized for a given specific initialization, and for the testing (the last 5,000 time slots) we randomly pick another DQN trained for another initialization with the same $R$ and $r$ parameters. In other words, for the unmatched DQN case, we skip the training stage and use the matched DQN that was trained for a different network initialization scenario and was stored in the memory. Here an unmatched DQN is always trained for a random initialization with $n$ = 19 links and ${f}_{d}$ = 10 Hz.
In Table I, we vary $R$ and see that training a DQN from scratch for the specific initialization is able to outperform both stateoftheart centralized algorithms that are under ideal conditions such as full CSI and no delay. Interestingly, the unmatched DQN approach converges to the central power allocation where we feed the FP algorithm with delayed full CSI. The DQN approach achieves this performance with distributed execution and incomplete CSI. In addition, training a DQN from scratch enables our algorithm to learn to compensate for CSI delays and specialize for its network initialization scenario. Training a DQN from scratch swiftly converges in about 25,000 time slots (shown in Fig. (a)a).
Additional simulations with $r$ and ${f}_{d}$ taken as variables are summarized in Table II and Table III, respectively. As the area of receiverfree inner region increases, the receivers get closer to the interfering transmitters and the interference mitigation becomes more necessary. Hence, the random and fullpower allocations tend to show much lower sumrate performance compared to the central algorithms. For that case, our algorithm still shows decent performance and the convergence rate is still about 25,000 time slots. We also stress the DQN under various ${f}_{d}$ scenarios. As we reduce ${f}_{d}$, its sumrate performance remains unchanged, but the convergence time drops to 15,000 time slots. As ${f}_{d}\to \mathrm{\infty}$, i.e., we set $\rho =0$ to remove the temporal correlation between current channel condition and past channel conditions, the convergence takes more than 35,000 time slots. Intuitively, the reason of this effect on the convergence rate is that the variation of states visited during the training phase is proportional to ${f}_{d}$. Further, the comparable performance of the unmatched DQN with the central power allocation shows the robustness of our algorithm to the changes in interference conditions and fading characteristics of the environment.
VB2 Scalability
In this subsection, we increase the total number of links to investigate the scalability of our algorithm. As we increase $n$ to 50 links, the DQN still converges in 25,000 time slots with high sumrate performance. As we keep on increasing $n$ to 100 links, from Table IV, the matched DQN’s sumrate outperformance drops because of the fixed input architecture of the DQN.
Note that each agent only considers $c=5$ interferer and interfered neighbors. The performance of DQN can be improved for that case by increasing $c$ at a higher computational complexity. Additionally, the unmatched DQN trained for just 19 links still shows good performance as we increase the number of links.
It is worth pointing out that each agent is able to determine its own action in less than $0.5$ ms on a personal computer. Therefore, our algorithm is suitable for dynamic power allocation. In addition, running a single batch takes less than $T$ = 20 ms. Most importantly, because of the fixed architecture of the DQN, increasing the total number of links from 19 to 100 has no impact on these values. It will just increase the queue memory in the network trainer. For the FP algorithm it takes about 15 ms to converge for $n$ = 19 links, but with $n$ = 100 links it becomes 35 ms. The WMMSE algorithm converges slightly slower, and the convergence time is still proportional to $n$ which limits its scalability.
VB3 Extendability to MultiLink per Cell Scenarios and Different Channel Models
In this subsection, we first consider a special homogeneous cell deployment case with colocated transmitters at the cell centers. We also assume that the colocated transmitters within a cell do not perform successive interference cancellation [9]. The WMMSE and FP algorithms can be applied to this multilink per cell scenario without any modifications.
We fix $R$ and $r$ to 500 and 10 meters, respectively. We set ${f}_{d}$ to 10 Hz and the total number of cells to $19$. We first consider two scenarios where each cell has 2 and 4 links, respectively. The third scenario assigns each cell a random number of links from 1 to 4 links per cell as shown in Fig. (b)b. The testing stage results for these multilink per cell scenarios are given in Table V. As shown in Table VI, we further test these scenarios using a different channel model called urban microcell (UMi) street canyon model of [41]. For this model, we take the carrier frequency as 1 GHz. The transmitter and receiver antenna heights are assumed to be 10 and 1.5 meters, respectively.
Our simulations for these scenarios show that as we increase number of links per cell, the training stage still converges in about 25,000 time slots. Fig. (a)a shows the convergence rate of training stage for 4 links per cell scenario with 76 links. In Fig. (a)a, we also show that using a different channel model, i.e., UMi street canyon, does not affect the convergence rate. Although the convergence rate is unaffected, the proposed algorithm’s average sumrate performance decreases as we increase number of links per cell. Our algorithm still outperforms the centralized algorithms even for 4 links per cell scenario for both channel models. Another interesting fact is that although the unmatched DQN was trained for a singlelink deployment scenario and can not handle the delayed CSI constraint as good as the matched DQN, it gives comparable performance with the ‘central’ case. Thus, the unmatched DQN is capable of finding good estimates of optimal actions for unseen local state inputs.
VC Proportionally Fair Scheduling
In this subsection, we change the link weights according to (7) to ensure fairness as described in Section III. We choose the $\beta $ term in (6) to be 0.01 and use convergence to the objective in (8) as performancemetric of the DQN. We also make some additions to the training and testing stage of DQN. We need an initialization for the link weights. This is done by letting all transmitters to serve their receivers with fullpower at $t$ = 0, and initialize weights according to the initial spectral efficiencies computed from (4). For the testing stage, we reinitialize the weights after the first 40,000 slots to see whether the trained DQN can achieve fairness as fast as the centralized algorithms.
As shown in Fig. 7, the training stage converges to a desirable scheduling in about 30,000 time slots. Once the network is trained, as we reinitialize the link weights, our algorithm converges to an optimal scheduling in a distributed fashion as fast as the centralized algorithms. Next, we set $R$ and $r$ as variables to get results in Table VII and Table VIII. We see that the trained DQN from scratch still outperforms the centralized algorithms in most of the initializations, using the unmatched DQN also achieves a high performance similar to the previous sections.
VI Conclusion and Future Work
In this paper, we have proposed a distributively executed modelfree power allocation algorithm which outperforms or achieves comparable performance with existing stateoftheart centralized algorithms. We see potentials in applying the reinforcement learning techniques on various dynamic wireless network resource management tasks in place of the optimization techniques. The proposed approach returns the new suboptimal power allocation much quicker than two of the popular centralized algorithms taken as the benchmarks in this paper. In addition, by using the limited local CSI and some realistic practical constraints, our deep Qlearning approach usually outperforms the generic WMMSE and FP algorithms which requires the full CSI which is an inapplicable condition. Differently from most advanced optimization based power control algorithms, e.g. WMMSE and FP, that require both instant and accurate measurements of individual channel gains, our algorithm only requires accurate measurements of some delayed received power values that are higher than a certain threshold above noise level. An extension to an imperfect CSI case with inaccurate CSI measurements is left for future work.
Meng et al. [42] is an extension of our preprint version [8] to multiple users in a cell, which is also addressed in the current paper. Although the centralized training phase seems to be a limitation on the proposed algorithm in terms of scalability, we have shown that a DQN trained for a smaller wireless network can be applied to a larger wireless network and a jumpstart on the training of DQN can also be implemented by using initial parameters taken from another DQN previously trained for a different setup.
Finally, we used global training in this paper, whereas reinitializing a local training over the regions where new links joined or performance dropped under a certain threshold is also an interesting direction to consider. Besides the regional training, completely distributed training can be considered, too. While a centralized training approach saves computational resources and converges faster, distributed training may beat a path for an extension of the proposed algorithm to some other channel deployment scenarios that involves mobile users. The main hurdle on the way to apply distributed training is to avoid the instability caused by the environment nonstationarity.
VII Acknowledgement
We thank Dr. Mingyi Hong, Dr. Wei Yu, Dr. Georgios Giannakis, and Dr. Gang Qian for stimulating discussions.
References
 [1] M. Chiang, P. Hande, T. Lan, and C. W. Tan, “Power control in wireless cellular networks,” Foundations and Trends in Networking, vol. 2, no. 4, pp. 381–533, 2007.
 [2] Q. Shi, M. Razaviyayn, Z. Q. Luo, and C. He, “An iteratively weighted MMSE approach to distributed sumutility maximization for a MIMO interfering broadcast channel,” IEEE Transactions on Signal Processing, vol. 59, no. 9, pp. 4331–4340, Sept 2011.
 [3] K. Shen and W. Yu, “Fractional programming for communication systems—part I: Power control and beamforming,” IEEE Transactions on Signal Processing, vol. 66, no. 10, pp. 2616–2630, May 2018.
 [4] I. Sohn, “Distributed downlink power control by messagepassing for very largescale networks,” International Journal of Distributed Sensor Networks, vol. 11, no. 8, p. e902838, 2015.
 [5] J. Huang, R. A. Berry, and M. L. Honig, “Distributed interference compensation for wireless networks,” IEEE Journal on Selected Areas in Communications, vol. 24, no. 5, pp. 1074–1084, May 2006.
 [6] S. G. Kiani, G. E. Oien, and D. Gesbert, “Maximizing multicell capacity using distributed power allocation and scheduling,” in 2007 IEEE Wireless Communications and Networking Conference, March 2007, pp. 1690–1694.
 [7] H. Zhang, L. Venturino, N. Prasad, P. Li, S. Rangarajan, and X. Wang, “Weighted sumrate maximization in multicell networks via coordinated scheduling and discrete power control,” IEEE Journal on Selected Areas in Communications, vol. 29, no. 6, pp. 1214–1224, June 2011.
 [8] Y. S. Nasir and D. Guo, “Deep reinforcement learning for distributed dynamic power allocation in wireless networks,” arXiv eprints, p. arXiv:1808.00490v1, Aug. 2018.
 [9] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos, “Learning to optimize: Training deep neural networks for interference management,” IEEE Transactions on Signal Processing, vol. 66, no. 20, pp. 5438–5453, Oct 2018.
 [10] M. J. Neely, E. Modiano, and C. E. Rohrs, “Dynamic power allocation and routing for timevarying wireless networks,” IEEE Journal on Selected Areas in Communications, vol. 23, no. 1, pp. 89–103, Jan 2005.
 [11] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
 [12] L. Liang, J. Kim, S. C. Jha, K. Sivanesan, and G. Y. Li, “Spectrum and power allocation for vehicular communications with delayed csi feedback,” IEEE Wireless Communications Letters, vol. 6, no. 4, pp. 458–461, Aug 2017.
 [13] N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y. Liang, and D. I. Kim, “Applications of deep reinforcement learning in communications and networking: A survey,” CoRR, vol. abs/1810.07862, 2018. [Online]. Available: http://arxiv.org/abs/1810.07862
 [14] H. Ye and G. Y. Li, “Deep reinforcement learning for resource allocation in V2V communications,” in 2018 IEEE International Conference on Communications (ICC), May 2018, pp. 1–6.
 [15] Y. Yu, T. Wang, and S. C. Liew, “Deepreinforcement learning multiple access for heterogeneous wireless networks,” in 2018 IEEE International Conference on Communications (ICC), May 2018, pp. 1–7.
 [16] R. Li, Z. Zhao, Q. Sun, C.L. I, C. Yang, X. Chen, M. Zhao, and H. Zhang, “Deep reinforcement learning for resource management in network slicing,” arXiv eprints, p. arXiv:1805.06591, May 2018.
 [17] M. Bennis and D. Niyato, “A Qlearning based approach to interference avoidance in selforganized femtocell networks,” in 2010 IEEE Globecom Workshops, Dec 2010, pp. 706–710.
 [18] M. Simsek, A. Czylwik, A. GalindoSerrano, and L. Giupponi, “Improved decentralized Qlearning algorithm for interference reduction in LTEfemtocells,” in 2011 Wireless Advanced, June 2011, pp. 138–143.
 [19] R. Amiri, H. Mehrpouyan, L. Fridman, R. K. Mallik, A. Nallanathan, and D. Matolak, “A machine learning approach for power allocation in hetnets considering qos,” in 2018 IEEE International Conference on Communications (ICC), May 2018, pp. 1–7.
 [20] E. Ghadimi, F. D. Calabrese, G. Peters, and P. Soldati, “A reinforcement learning approach to power control and rate adaptation in cellular networks,” in 2017 IEEE International Conference on Communications (ICC), May 2017, pp. 1–7.
 [21] F. D. Calabrese, L. Wang, E. Ghadimi, G. Peters, L. Hanzo, and P. Soldati, “Learning radio resource management in rans: Framework, opportunities, and challenges,” IEEE Communications Magazine, vol. 56, no. 9, pp. 138–145, Sep. 2018.
 [22] Z. Q. Luo and S. Zhang, “Dynamic spectrum management: Complexity and duality,” IEEE Journal of Selected Topics in Signal Processing, vol. 2, no. 1, pp. 57–73, Feb 2008.
 [23] D. N. C. Tse and P. Viswanath, Fundamentals of Wireless Communication. Cambridge University Press, 2005.
 [24] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,” Journal of artificial intelligence research, vol. 4, pp. 237–285, 1996.
 [25] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998.
 [26] S. Singh, T. Jaakkola, M. L. Littman, and C. Szepesvári, “Convergence results for singlestep onpolicy reinforcementlearning algorithms,” Machine learning, vol. 38, no. 3, pp. 287–308, 2000.
 [27] A. GalindoSerrano and L. Giupponi, “Distributed QLearning for interference control in OFDMABased femtocell networks,” in 2010 IEEE 71st Vehicular Technology Conference, May 2010, pp. 1–5.
 [28] O. Naparstek and K. Cohen, “Deep multiuser reinforcement learning for distributed dynamic spectrum access,” IEEE Transactions on Wireless Communications, pp. 1–1, 2018.
 [29] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436 EP –, May 2015.
 [30] J. Hu and M. P. Wellman, “Online learning about other agents in a dynamic multiagent system,” in International Conference on Autonomous Agents: Proceedings of the second international conference on Autonomous agents, vol. 10, no. 13. Citeseer, 1998, pp. 239–246.
 [31] T. T. Nguyen, N. Duy Nguyen, and S. Nahavandi, “Deep reinforcement learning for multiagent systems: A review of challenges, solutions and applications,” arXiv eprints, p. arXiv:1812.11794, Dec 2018.
 [32] J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. Torr, P. Kohli, and S. Whiteson, “Stabilising experience replay for deep multiagent reinforcement learning,” in Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org, 2017, pp. 1146–1155.
 [33] A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente, “Multiagent cooperation and competition with deep reinforcement learning,” PloS one, vol. 12, no. 4, p. e0172395, 2017.
 [34] J. K. Gupta, M. Egorov, and M. Kochenderfer, “Cooperative multiagent control using deep reinforcement learning,” in International Conference on Autonomous Agents and Multiagent Systems. Springer, 2017, pp. 66–83.
 [35] J. Watt, R. Borhani, and A. K. Katsaggelos, Machine learning refined: foundations, algorithms, and applications. Cambridge University Press, 2016.
 [36] “Requirements for further advancements for EUTRA (LTEAdvanced),” 3GPP TR 36.913 v.8.0.0, available at http://www.3gpp.org.
 [37] B. Zhuang, D. Guo, and M. L. Honig, “Energyefficient cell activation, user association, and spectrum allocation in heterogeneous networks,” IEEE Journal on Selected Areas in Communications, vol. 34, no. 4, pp. 823–831, April 2016.
 [38] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for largescale machine learning,” in 12th $\mathrm{\{}$USENIX$\mathrm{\}}$ Symposium on Operating Systems Design and Implementation ($\mathrm{\{}$OSDI$\mathrm{\}}$ 16), 2016, pp. 265–283.
 [39] S. Ruder, “An overview of gradient descent optimization algorithms,” CoRR, vol. abs/1609.04747, 2016. [Online]. Available: http://arxiv.org/abs/1609.04747
 [40] V. FrançoisLavet, R. Fonteneau, and D. Ernst, “How to discount deep reinforcement learning: Towards new dynamic strategies,” CoRR, vol. abs/1512.02011, 2015. [Online]. Available: http://arxiv.org/abs/1512.02011
 [41] “Study on channel model for frequencies from 0.5 to 100 ghz,” 3GPP TR 38.901 v.14.0.0, available at http://www.etsi.org.
 [42] F. Meng, P. Chen, and L. Wu, “Power allocation in multiuser cellular networks with deep Q learning approach,” arXiv eprints, p. arXiv:1812.02979, Dec. 2018.