Reinforcement Learning-Based Trajectory Design for the Aerial Base Stations

  • 2019-06-29 05:28:34
  • Behzad Khamidehi, Elvino S. Sousa
  • 0

Abstract

In this paper, the trajectory optimization problem for a multi-aerial basestation (ABS) communication network is investigated. The objective is to findthe trajectory of the ABSs so that the sum-rate of the users served by each ABSis maximized. To reach this goal, along with the optimal trajectory design,optimal power and sub-channel allocation is also of great importance to supportthe users with the highest possible data rates. To solve this complicatedproblem, we divide it into two sub-problems: ABS trajectory optimizationsub-problem, and joint power and sub-channel assignment sub-problem. Then,based on the Q-learning method, we develop a distributed algorithm which solvesthese sub-problems efficiently, and does not need significant amount ofinformation exchange between the ABSs and the core network. Simulation resultsshow that although Q-learning is a model-free reinforcement learning technique,it has a remarkable capability to train the ABSs to optimize their trajectoriesbased on the received reward signals, which carry decent information from thetopology of the network.

 

Quick Read (beta)

Reinforcement Learning-Based Trajectory Design for the Aerial Base Stations

Behzad Khamidehi and Elvino S. Sousa Emails: [email protected] and [email protected] Department of Electrical and Computer Engineering, University of Toronto, Canada
Abstract

In this paper, the trajectory optimization problem for a multi-aerial base station (ABS) communication network is investigated. The objective is to find the trajectory of the ABSs so that the sum-rate of the users served by each ABS is maximized. To reach this goal, along with the optimal trajectory design, optimal power and sub-channel allocation is also of great importance to support the users with the highest possible data rates. To solve this complicated problem, we divide it into two sub-problems: ABS trajectory optimization sub-problem, and joint power and sub-channel assignment sub-problem. Then, based on the Q-learning method, we develop a distributed algorithm which solves these sub-problems efficiently, and does not need significant amount of information exchange between the ABSs and the core network. Simulation results show that although Q-learning is a model-free reinforcement learning technique, it has a remarkable capability to train the ABSs to optimize their trajectories based on the received reward signals, which carry decent information from the topology of the network.

Reinforcement learning, Q-learning, Aerial Base Station (ABS), Trajectory optimization.

I Introduction

Supporting ceaselessly increasing number of mobile devices and their high data rate demands are among the most critical concerns of the wireless networks. Unfortunately, the currently utilized base stations (BSs) are not able to completely satisfy these requirements due to their static nature [1]. In particular, a long distance between mobile users and these static BSs causes a low quality for the communication link, and leads to poor coverage. To resolve this issue, finding technologies allowing the BSs to adaptively decrease their distances to the users is essential for the future communication networks. Motivated by this, aerial base stations (ABSs) have recently attracted significant attention due to their applications in wireless networks [2]. Considering advantages such as high mobility, low cost, and flexible deployement offered by the ABSs [3], they can be considered as a promising technology to improve the coverage of mobile users in the wireless networks.

The conducted research for the ABSs can be categorized into two lines: static ABSs and dynamic ABSs. For the statice ABSs, the objective is to find the optimal placement of the ABS(s) in a way that some criteria such as coverage or sum-rate of the users is maximized [4, 5, 6]. For instance, in [4] the optimal placement of the ABSs is determined to minimize the number of ABSs required to cover ground users, ensuring that each mobile user is in the coverage area of at least one of the ABSs. In [5] an algorithm is developed to optimize the 3-D deployment of an ABS with the objective of maximizing the number of covered users using the minimum transmit power. However, in all of these work, the location of the ABS is assumed to be fixed. For the dynamic ABSs, however, the main goal is to leverage the mobility of the ABSs. Thus, the trajectory of the ABS(s) is optimized to improve the performance of the mobile users [7, 8, 9]. This performance gain is resulted from the fact that when the distance between an ABS and user decreases, the probability of having a line-of-sight (LoS) link increases, and hence, higher data rates are achievable in comparison to the static case. In [7] based on the block coordinate descent and successive convex approximation techniques, the authors optimize the user association, ABS trajectories, and transmit power of a multi-ABS system with the objective of maximizing the minimum data rate of the users. In [8], data offloading problem for an ABS-enabled wireless network has been investigated and the trajectory of the ABS obtained with the objective of maximizing the sum-rate of the users served by the ABS, while minimum data requirement has been considered for all users. In [9] the authors proposed an algorithm to maximize the minimum data rate of ground users by jointly optimizing the ABS trajectory and its power and bandwidth. It is worth mentioning that in all of these work, it is assumed that perfect knowledge of the environment is available. This assumption is not practical since the topology of the network can change continuously, and to have perfect knowledge of the environment, a significant amount of information has to be exchanged between the ABSs and the core network, which is not possible.

Motivated by this, we propose a distributed algorithm based on Q-learning to optimize the trajectory of the ABSs. In contrast to the existing algorithms, our algorithm does not need perfect knowledge of the environment, and the amount of information exchanged between ABSs and the core network is negligible. The objective of our trajectory design algorithm is to maximize the sum-rate of users served by each ABS. Therefore, it is of great importance to allocate optimal power and sub-channels to the users to support them with the highest data rates. To solve this complicated problem, we divide it into two sub-problems: trajectory optimization sub-problem (higher level) and joint power and sub-channel allocation sub-problem (lower level). In the higher level, based on the reinforcement learning techniques, a Q-learning problem is formulated to train and update the trajectory of the ABSs using the feedback signal received from the environment. On the other hand, in the lower level, a joint power and sub-channel allocation problem is solved to form the reward function emerging from taking actions by the ABSs in the higher level.

The rest of the paper is organized as follows: Section II describes the system model. The channel model is also discussed in this section. Principals of reinforcement learning is presented in section III. Section IV formulates the trajectory design problem as a Q-learning problem. Our algorithm is also presented in section IV. Simulation results and conclusion are presented in section V and VI, respectively.

II System Model

In this paper, we consider the downlink of a wireless network integrated with multiple ABSs. The set of all ABSs is presented by 𝒥={1,2,,J}, and the set of users associated with the j-th ABS is denoted by 𝒦j. Moreover, the ABSs share 𝒩={1,2,,N} orthogonal sub-channels. We use indices j, k, and n to represent ABSs, mobile users, and sub-channels. The position of the j-th ABS at time t is denoted by 𝛀j(t)=(xj(t),yj(t),H), where H is the altitude of the ABSs which is assumed to be constant throughout their flight time. The initial and final position of the j-th ABS are indicated by 𝛀j0=(xj0,yj0,H) and 𝛀jF=(xjF,yjF,H), respectively. According to this notation, the trajectory of the j-th ABS starts from 𝛀j0 and ends to 𝛀jF. We assume that the speed of the ABSs is constant during their flights, i.e., Vj(t)=V, t,j, where V is the constant velocity. The distance between ABS j and its final position 𝛀jF at time t is expressed as

Dj(t)𝛀j(t)-𝛀jF2,j,t, (1)

and the distance between ABS j1 and ABS j2 at time t is

Dj1,j2(t)𝛀j1(t)-𝛀j2(t)2,j1j2,t. (2)

To avoid any collision between the ABSs, their trajectories are subject to collision avoidance constraint as follows

Dj1,j2(t)>Dmin,j1j2,t, (3)

where Dmin is the minimum distance between any pair of ABSs that ensures collision avoidance.

For the air-to-ground links between ABSs and mobile users, we assume that the channel gains are composed of path loss and frequency selective Rayleigh fading. If hjkn(t) denotes the channel between the j-th ABS and user k over the n-th sub-channel at time t, we have

hjkn(t)=ρjkn(t)PLjk(t), (4)

where ρjkn(t) accounts for the small scale fading, and PLjkn(t) is the average path loss of the link at time t. For the path loss, we adopt the model proposed in [10]. According to this model, both LoS and Non-LoS propagation groups are involved in the average path loss expression. To consider the effects of LoS and Non-LoS links on the average path loss, we need to know their probabilities which depend on the elevation angle between the ABS and the mobile users. Based on [10], the probability of having a LoS link between the j-th ABS and user k at time t is given by

prjkLoS(t)=11+a exp(-b(θjk(t)-a)), (5)

where θik(t) is the elevation angle between ABS j and user k at time t, and a and b are constants depending on the environment. The elevation angle is given as θik(t)=arctan(HL(t)), where H and L(t) are the altitude and the horizontal distance between the j-th ABS and user k at time t, respectively. Based on the probability given in (5), the average path loss between ABS j and user k, PLjk(t) can be formulated as

PLjk(t)=prjkLoS(t)PLLoS(t)+(1-prjkLoS(t))PLN-LoS(t), (6)

where PLLoS(t) and PLN-LoS(t) are the free space path losses corresponding to the LoS and Non-LoS links, respectively. The free space path loss corresponding to the LoS is expressed as PLLoS(t)=(4πfcdjk(t)c)2×ηLoS, where djk(t) represents the distance between the j-th ABS and user k at time t, fc is the carrier frequency, c is the speed of light, and ηLoS is the environment-dependant additional loss due to LoS link. In a similar way, the free space path loss corresponding to the Non-LoS is described as PLN-LoS(t)=(4πfcdjk(t)c)2×ηN-LoS, where ηN-LoS is the additional loss due to the Non-LoS link. For the small scale fading, without loss of generality we assume that for j,k,n, ρjkn(t) are independent and identically distributed (i.i.d) random variables with 𝔼{|ρjkn|2}=1. Using the described model, the channel gain between the j-th ABS and user k at time t is given by

gjkn(t)|hjkn(t)|2. (7)

As discussed earlier, in this paper, we are to find trajectories of the ABSs with the objective of maximizing sum-rate of the users associated with each ABS. As a result, in addition to the optimal trajectory design, optimal power and sub-channel allocation is also necessary to support the users with the highest data rates. If the transmit power of the j-th ABS on the n-th sub-channel and time t is denoted by Pjn(t), according to Shannon formula, the instanteneous rate of user k served by the j-th ABS at time t on sub-channel n is given by

rjkn(t)=log(1+Pjn(t)gjkn(t)Ijkn(t)+σ2), (8)

where σ2 is the thermal noise at the receiver of the k-th mobile user and Ijkn(t) is the total interference on the n-th sub-channel and time t arising from other ABSs and ground base station (GBS) to the receiver of the k-th user. The interference Ijkn(t) is expressed as

Ijkn(t)=j=1,jjJPjn(t)gjkn(t)+P0n(t)g0kn(t), (9)

where the first summation is the total interference from the other ABSs and the last term accounts for the interference arising from the GBS. Accordingly, P0n(t) is the transmit power of GBS at time t over the n-th sub-channel, and g0kn(t) is the channel gain from the GBS to user k over sub-channel n and time t. Moreover, we define the sub-channel assignment indicator as

ajkn(t){0,1},j,k,n,t, (10)

where ajkn(t)=1 indicates that at time t, sub-channel n is assigned to the user k by its associating ABS j. Otherwise the value of ajkn(t) is zero.

III Reinforcement Learning Fundamentals

In this section, we briefly explain the reinforcement learning principals. Then, we formulate our trajectory design problem as a Q-learning problem which can be solved efficiently.

In reinforcement learning, the agent interacts with the environment at each of a sequence of discrete time steps denoted by t=0,1,2,. If st and 𝒮 denote the state occupied by the agent at time t and the set of all possible states, respectively, based on the current state st, the agent takes an action at𝒜(st), where 𝒜(st) is the set of actions available at state st. One time step later, agent receives reward rt+1 and goes to a new state st+1. Fig. 1 presents the interaction between agent and environment.

Fig. 1: Interaction between agent and environment.

The agent at each time step implements a policy π(a|s) which is the probability that at=a if st=s, i.e.,

π(a|s)=Pr(at=a|st=s). (11)

The implemented policy is modified over time based on the previous experience. Reinforcement learning describes how this policy change is made as a result of agent’s experience [11]. The goal of an agent is not maximizing its immediate reward, but the expected sum of discounted reward it receives over the long run is maximized. For this purpose, we define a return function Rt as

Rtk=0T-1γkrt+k+1, (12)

where T and γ are the final time step and the discount rate 0γ<1, respectively. Based on (11) and (12), we can define a function which represents the value of taking action a in state s under policy π, denoted by Qπ(s,a) as

Qπ(s,a)=𝔼π{Rt|st=s,at=a}. (13)

In other words, Qπ(s,a) is the expected return starting from s, taking action a, and following policy π. This function is called action-value function or simply Q-function. Using the definition of return in (12), the action-value function can be decomposed into immediate reward plus discounted Q-function of successor state as

Qπ(s,a)=𝔼π{rt+1+γQπ(st+1,at+1)|st=s,at=a}. (14)

The optimal Q-function is expressed as

Q(s,a)=maxπQπ(s,a), (15)

and the optimal policy can be obtained by maximizing over Q(s,a) according to

π(a|s)={1if a=argmaxaQ(s,a)0Otherwise. (16)

Moreover, the optimal Q-function has to satisfy the Bellman optimality equation [11] for the Q-function as

Q(s,a)=𝔼π{rt+1+γmaxaQ(st+1,a)|st=s,at=a}. (17)

It is worth mentioning here that the Bellman optimality equation is non-linear and in general does not have any closed form solution. Alternatively, to solve (17) we have to use an iterative method. Q-learning is a well-known method for finding Q(s,a) in a recursive manner [12]. According to this method, the learned action-value function is updated as

Q(st,at)Q(st,at)+α[rt+1+γmaxaQ(st+1,a)-Q(st,at)], (18)

where α[0,1] is the learning rate. The most important advantage of this method is that the learned Q-function derived by (18) directly approximates the optimal action-value function, Q(s,a), independent of the policy followed by the agent [11].

IV Trajectory optimization as a Q-learning problem

In this section, the trajectory optimization is formulated as a Q-learning problem. To find the trajectories of the ABSs with the objective of maximizing sum-rate of the users associated with each ABS, we divide the problem into two sub-problems: trajectory optimization sub-problem and joint power and sub-channel allocation sub-problem. In other words, a two stage decision making procedure is considered to deal with the original complicated problem. In the first stage, using the Q-learning technique, ABSs take one of the available actions to obtain their trajectories. Upon any change in the position of the ABSs, in the second stage, a new optimization problem is solved to allocate optimal power and sub-channels to the mobile users. The corresponding sum-rate contributes to the reward received from taking the action in the first stage. In what follows, we define the agents, states, actions, and reward associated to the Q-learning framework:
Agent j: ABS j, 1jJ.
State st(j): Position of ABS j at time t. In other words, st(j)={xj(t),yj(t),H}. It is worth mentioning that in general, the position of each ABS can be a continuous function. To restrict the number of possible states, we place a grid with limited number of squares on it to discretize the area. The center of each square represents that state. For instance, if the area of interest is rectangular and is presented by xminxxmax and yminyymax, we convert this continuous area into M2 states by spliting both [xmin,xmax] and [ymin,ymax] intervals into M slots. In the resulting grid, the coordinates of the center of the square located at the k1-th slot in the x axis and the k2-th slot in the y slot are xk1=xmin+(xmax-xmin)M(k1-1) and yk2=ymin+(ymax-ymin)M(k2-1), respectively. As a result, the number of available states in our problem is limited to M2.
Action at(j): Available actions at each state are the movement in four directions: left, right, forward, and backward.
Reward rt(j): The reward function for the j-th agent is defined as follows

rt(j)=β1F1(j)(t)-β2F2(j)(t)-β3F3(j)(t), (19)

where F1(j) reflects the sum-rate of the users served by ABS j, F2(j) motivates the ABS to complete its flight in a reasonably short time, and F3(j) is an activation function required for the safety of the ABSs. In (19), β1, β2, and β3 are parameters of the reward function. In what follows we introduce functions F1(j), F2(j), and F3(j) in more details and explain the rationale behind this selection.

As discussed earlier, Q-learning aims to maximize the sum-rate of the users while the time to complete the flight is reasonably short. Moreover, for safety purposes, the distance between any pair of ABSs must be greater than some predefined threshold. Based on these concerns, we design the reward function. Function F1j(t) is defined to be the sum-rate of mobile users currently served by the ABS j. This function is expressed as

F1(j)(t)=k=1Kn=1Najkn(t)log(1+Pjn(t)gjkn(t)Ijkn(t)+σ2), (20)

where Pjn(t) and ajkn(t) are the optimal power and sub-channel allocation at time t, respectively. To find these values, the j-th ABS solves the following optimization problem at time t

maxP,a k=1Kn=1Najkn(t)log(1+Pjn(t)gjkn(t)Ijkn(t)+σ2) (21a)
s.t. k=1Kn=1Najkn(t)Pjn(t)Pmax, (21b)
k=1Kajkn(t)1,n, (21c)
ajkn(t){0,1},k,n, (21d)

where constraint (21b) ensures that the transmit power of each ABS is limited to Pmax and (21c) guarantees that at each time t, each sub-channel is assigned to at most one mobile user. The optimization problem of (21a) is a non-convex mixed integer problem due to constraint (21d). To make it tractable, we first relax the binary variables ajkn(t) to be continuous in the interval of [0,1]. Moreover, we denote the actual power allocated to the k-th mobile user served by the j-th ABS on the n-th sub-channel and time t with P~jkn(t)ajkn(t)Pjn(t). As a result, the new optimization problem is given by

maxP~,a k=1Kn=1Najkn(t)log(1+P~jn(t)gjkn(t)ajkn(t)(Ijkn(t)+σ2)) (22a)
s.t. k=1Kn=1NP~jkn(t)Pmax, (22b)
0ajkn(t)1,k,n, (22c)
(21c).

It can be shown that the function f(x,y)=xlog(1+yxh), for a constant h, is concave over x0 and y0 since its Hessian matrix is negative semi-definite. Therefore, problem of (22a) is a convex problem. To solve (22a), we form its Lagrangian function as follows

{𝐏,𝐚,λ,μ}=k=1Kn=1Najkn(t)log(1+Pjn(t)gjkn(t)Ijkn(t)+σ2)+λ(Pmax-k=1Kn=1Najkn(t)Pjn(t))+n=1Nμn(1-k=1Kajkn(t)), (23)

where λ and μn are the dual multipliers associated with constraints (21b) and (21c), respectively. The Lagrangian dual function is defined as

g(λ,μ)=max𝐩~,𝐚{𝐏,𝐚,λ,μ},

and the dual problem is

minλ,μg(λ,μ)  s.t. λ,μ0. (24)

According to the KKT conditions [13], the optimal solution of (22a) must sastisfy P~jkn=0 and ajkn=0, As a result, we obtain

Pjn(t)=P~jkn(t)ajkn(t)=[1ln2λ-Ijkn(t)+σ2hjkn(t)]+, (25)

and

ajkn(t)=1|k=argmaxkΨjkn(t), (26)

where

Ψjkn(t)=log(1+Pjn(t)hjkn(t)Ijkn(t)+σ2)-1ln2Pjn(t)hjkn(t)Pjn(t)hjkn(t)+Ijkn(t)+σ2.

In other words, sub-channel n is assigned to user k with the largest value of Ψjkn(t). The dual variable λ can be updated with the sub-gradient method according to

λ(l+1)=[λ(l)-α(l)(Pmax-k=1Kn=1NP~jkn(t))]+, (27)

where α(l) is the step size in the l-th iteration which has to satisfy l=1α(l)=, and limlα(l)=0. Using (25) and (26), the sum-rate of the users associated with the j-th ABS is derived by (20).

{algorithm}

[t] Distributed Q-Learning for the ABS trajectory design {algorithmic}[1] \StateFor all ABSs, initialize their Q-function Qj(s,a) for all s𝒮, a𝒜(s) and j𝒥, arbitrarily. \StateSet Qj(Terminal state,.)=0 for all ABSs. \Forepisode=1 to max episode \StateInitialize s0(j) based on the initial location of each ABS j. \Foreach step of episode (time t) \Foreach ABS \Ifrand(.)<ϵ \Stateselect action randomly (exploration) \Else\StateChoose action at(j)=argmaxaQj(st(j),a) (exploitation) \EndIf\StateTake action at(j), \StateReceive the immediate reward, rt+1(j) according to (19) \StateObserve the new state st+1(j) \StateUpdate Q-function for ABS j as \StateQj(st(j),at(j))Qj(st(j),at(j))+α[rt+1(j)+γmaxaQj(st+1(j),a)-Qj(st(j),at(j))] \EndFor\EndFor\EndFor

In addition to the sum-rate function F1(j)(t), we need another function F2(j)(t) to motivate the ABS to reach the terminal state (final position) at the end of its flight time. This prevents ABSs from staying at a specific point, and forces the ABSs to reach their destinations. We define

F2(j)(t)=Dj(t), (28)

where Dj(t) is the distance between ABS j and its final position at time t and is defined in (1). Moreover, we need another function F3(j)(t) to act like an activation function returning a value when the distance between ABS j to any other ABS is less than the threshold distance required for the collision avoidance. In other words,

F3(j)(t)={1Dj,j(t)<Dmin,jj0Otherwise. (29)

Based on (20), (28), and (29), the reward function is expressed as (19). Algorithm IV shows our distributed Q-learning algorithm for the ABS trajectory design.

V Simulation Results

In this section, numerical results are presented to show the performance of the proposed algorithm. We consider a 3km × 3km area which using a rectangular grid is divided into M2=900 states. Users are randomly distributed in the area. In our simulations we assume a dual-ABS system. The number of sub-channels and the carrier frequency are N=8 and fc=2GHz, respectively. The total number of users in our simulations is 20 in which half of them are assigned to each one of the ABSs. The altitude of the ABSs is assumed H=100m. Other parameters are as follows: (a,b)=(5,0.5), ηLoS=1, ηN-LoS=20, γ=0.9, ϵ=0.1, (β1,β2,β3)=(10,0.25,1000) , V=10m/s, Pmax=200mW, Dmin=5m, [𝐱min,𝐱max]=[0,3000], and [𝐲min,𝐲max]=[0,3000].

Fig. 2 presents the final trajectory of the ABSs. This figure shows that although Q-learning is a model-free method, it has ability to make the agents aware of the environment. In other words, using the feedback signals which are the rewards signals, the ABSs can be trained to find their trajectories based on the topology of the network. This figure also shows that in addition to moving from the initial position toward the final position, the ABSs repeatedly decrease their distances to their associated users. This is essential for the ABSs since by reducing the distance to a user, the link quality between the ABS and the aformentioned user is improved. As a result, the data rate of the user increases. Moreover, it is observed that the ABSs eventually reach to their final positions either to get prepared for the next flight time or to recharge their batteries. This is done so that the distance between the ABSs at any time during their flight remains more than Dmin.

Fig. 3 shows convergence of the average sum-rate of the users. Index e indicates the episode number. As can be observed, at the begining of the learning process, the achievable sum-rate fluctuates widely. However, after a sufficient number of episodes, this fluctuation becomes negligible and the sum-rate converges to its optimal value. This is due to the fact that the agents (ABSs) are trained based on the feedback signals they receive. These feedback signals carry required information from the environment. Therefore, After a sufficient number of episodes, the ABSs will have a decent knowledge of the topology of the network. This improves the performance of the ABSs in terms of their trajectory and their achievable data rates.

Fig. 2: Final trajectory of the ABSs.
Fig. 3: Convergence of the average sum-rate of the users

VI Conclusion

In this paper, we studied the trajectory optimization problem for a wireless network integrated with multiple ABSs. The objective is to maximize the total data rate of users served by each ABS. As a result, in addition to the trajectory optimization, it is of great importance to allocate optimal power and sub-channels to the users to support them with the highest possible data rates. To address all, we divide the problem into two sub-problems: trajectory optimization sub-problem and joint power and sub-channel allocation sub-problem. For the trajectory optimization sub-problem, a reinforcement learning problem has been formulated while for the later one an optimization problem has been solved. The simulation results show that although Q-learning is a model-free reinforcement learning method, it effectively trains the ABSs to optimize their trajectories.

References

  • [1] I. Bor-Yaliniz and H. Yanikomeroglu, “The New Frontier in RAN Heterogeneity: Multi-Tier Drone-Cells,” IEEE Commun. Mag., vol. 54, no. 11, pp. 48–55, Nov. 2016.
  • [2] Y. Zeng, J. Lyu, and R. Zhang, “Cellular-Connected UAV: Potential, Challenges, and Promising Technologies,” IEEE Wireless Commun., vol. 26, no. 1, pp. 120–127, Feb. 2019.
  • [3] Y. Zeng, R. Zhang, and T. J. Lim, “Wireless Communications with Unmanned Aerial Vehicles: Opportunities and Challenges,” IEEE Commun. Mag., vol. 54, no. 5, pp. 36–42, May 2016.
  • [4] J. Lyu, Y. Zeng, R. Zhang, and T. J. Lim, “Placement Optimization of UAV-Mounted Mobile Base Stations,” IEEE Commun. Lett, vol. 21, no. 3, pp. 604–607, Mar. 2017.
  • [5] M. Alzenad, A. El-Keyi, F. Lagum, and H. Yanikomeroglu, “3-D Placement of an Unmanned Aerial Vehicle Base Station (UAV-BS) for Energy-Efficient Maximal Coverage,” IEEE Wireless Commun. Lett., vol. 6, no. 4, pp. 434–437, Aug. 2017.
  • [6] R. Ghanavi, E. Kalantari, M. Sabbaghian, H. Yanikomeroglu, and A. Yongacoglu, “Efficient 3D aerial base station placement considering users mobility by reinforcement learning,” in Proc. IEEE WCNC, Apr. 2018, pp. 1–6.
  • [7] Q. Wu, Y. Zeng, and R. Zhang, “Joint Trajectory and Communication Design for Multi-UAV Enabled Wireless Networks,” IEEE Trans. Wireless Commun., vol. 17, no. 3, pp. 2109–2121, Mar. 2018.
  • [8] F. Cheng, S. Zhang, Z. Li, Y. Chen, N. Zhao, F. R. Yu, and V. C. M. Leung, “UAV Trajectory Optimization for Data Offloading at the Edge of Multiple Cells,” IEEE Trans. Veh. Technol., vol. 67, no. 7, pp. 6732–6736, Jul. 2018.
  • [9] Q. Wu and R. Zhang, “Common Throughput Maximization in UAV-Enabled OFDMA Systems with Delay Consideration,” 2018. [Online]. Available: http://arxiv.org/abs/1801.00444
  • [10] A. Al-Hourani, S. Kandeepan, and S. Lardner, “Optimal LAP Altitude for Maximum Coverage,” IEEE Wireless Commun. Lett., vol. 3, no. 6, pp. 569–572, Dec 2014.
  • [11] R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning, 1st ed.   Cambridge, MA, USA: MIT Press, 1998.
  • [12] C. J. C. H. Watkins and P. Dayan, “Q-Learning,” Machine Learning, vol. 8, no. 3, pp. 279–292, May 1992.
  • [13] S. Boyd and L. Vandenberghe, Convex Optimization.   Cambridge University Press, 2004.