A Communication-Efficient Multi-Agent Actor-Critic Algorithm for Distributed Reinforcement Learning

  • 2019-07-06 00:20:50
  • Yixuan Lin, Kaiqing Zhang, Zhuoran Yang, Zhaoran Wang, Tamer Başar, Romeil Sandhu, Ji Liu
  • 2

Abstract

This paper considers a distributed reinforcement learning problem in which anetwork of multiple agents aim to cooperatively maximize the globally averagedreturn through communication with only local neighbors. A randomizedcommunication-efficient multi-agent actor-critic algorithm is proposed forpossibly unidirectional communication relationships depicted by a directedgraph. It is shown that the algorithm can solve the problem for stronglyconnected graphs by allowing each agent to transmit only two scalar-valuedvariables at one time.

 

Quick Read (beta)

A Communication-Efficient Multi-Agent Actor-Critic Algorithm for Distributed Reinforcement Learning

Yixuan Lin, Kaiqing Zhang, Zhuoran Yang, Zhaoran Wang, Tamer Başar,
Romeil Sandhu, and Ji Liu
Y. Lin is with the Department of Applied Mathematics and Statistics at Stony Brook University ([email protected]). K. Zhang and T. Başar are with the Coordinated Science Laboratory at University of Illinois at Urbana-Champaign ({kzhang66,basar1}@illinois.edu), and their research was supported in part by the US Army Research Laboratory (ARL) Cooperative Agreement W911NF-17-2-0196. Z. Yang is with the Department of Operations Research and Financial Engineering at Princeton University ([email protected]). Z. Wang is with the Department of Industrial Engineering and Management Sciences at Northwestern University ([email protected]). R. Sandhu is with the Departments of Bioinformatics and Computer Science at Stony Brook University ([email protected]). J. Liu is with the Department of Electrical and Computer Engineering at Stony Brook University ([email protected]).
Abstract

This paper considers a distributed reinforcement learning problem in which a network of multiple agents aim to cooperatively maximize the globally averaged return through communication with only local neighbors. A randomized communication-efficient multi-agent actor-critic algorithm is proposed for possibly unidirectional communication relationships depicted by a directed graph. It is shown that the algorithm can solve the problem for strongly connected graphs by allowing each agent to transmit only two scalar-valued variables at one time.

I. Introduction

Recently, there has been increasing interest in developing distributed machine learning algorithms. Notable examples include distributed linear regression [1], multi-arm bandit [2], reinforcement learning (RL) [3], and deep learning [4]. Such algorithms have promising applications in large-scale networks, such as social platforms, online economic networks, cyber-physical systems, and Internet of Things, primarily because in such a complex network, it is impossible to collect all the information at the same point and each component of the network may not be willing to share its private information due to privacy issues.

Multi-agent reinforcement learning (MARL) problems have recently received increasing attention. In general, MARL problems are investigated in settings that are either collaborative, competitive, or a mixture of the two. For collaborative MARL, the most rudimentary framework is the canonical multi-agent Markov decision process [5, 6], where the agents share a common reward function that is determined by the joint actions of all agents. Another notable framework for collaborative MARL is the team Markov game model, also with a shared reward function among agents [7, 8]. These two frameworks were then extended to the setting where agents are allowed to have heterogeneous reward functions [9, 3, 10, 11, 12], collaborating with the goal of maximizing the long-term return corresponding to the team averaged reward. In particular, these works focused on a fully-decentralized/distributed setting, where there exists no central controller to coordinate the agents to achieve the overall team goal. Instead, the agents are connected via a communication network and are only able to exchange information with the neighbors on the network. There is also an ever-growing number of works on MARL in competitive and mixed settings [13, 14, 15, 16], where most of the recent ones are empirical works without theoretical convergence guarantees. In this work, we focus on the decentralized/distributed and collaborative MARL setting with networked agents as in [9, 3, 17].

Within this setting, the work of [3] proposed the first fully distributed actor-critic algorithm. The algorithm allows the agents to exchange information over a communication network with possibly sparse connectivity at each agent, which improves the scalability of the multi-agent model with a high population of agents and thus tackles one of the long-standing challenges in general MARL problems [18]. A detailed comparison of the problem setting to the other existing ones on multi-agent and collaborative MARL is provided in [3].

A possible communication issue of the algorithms proposed in [3] is that they require each agent to transmit the entire vector of its estimate of ω to its neighbors at each time. Such communication-costly algorithms may not be possible to secure in some learning applications, for example, when the size of ω is very large but each agent has limited communication capacity. In this paper, we propose an approach in which each agent broadcasts only one (scaled) entry of its estimate of ω, thus significantly reducing communication costs at each iteration. The algorithms in [3] more or less rely on doubly stochastic matrices, which implicitly requires bidirectional communication between each pair of neighboring agents. This requirement restricts the applications of the algorithm in scenarios with possibly uni-directional communication. To get around this limitation, we propose a variant using the idea of push-sum [19] with which each agent only needs to transmit two scalar-valued variables at each time, and in particular, each agent can independently determine which entry of the estimate of ω to transmit. For simplicity, we take a randomized way and show that with this very cheap communication scheme, the algorithm can solve the distributed reinforcement learning problem for any strongly connected graph that depicts the communication relationships among the agents. Note that in parallel with this work, [20] appeared to be the first one that considers the communication efficiency in MARL and distributed RL in general. However, the setting in [20] assumes the existence of a central controller. In contrast, to the best of our knowledge, our work proposes the first communication-efficient MARL/distributed RL algorithm for networked agents.

The remaining part of this paper is organized as follows. Section II introduces the MARL problem and presents a communication-efficient variant of the algorithm in [3], which essentially requires bi-directional communication and transmission coordination between neighboring agents. Motivated by the restrictive requirements, Section III proposes a multi-agent actor-critic algorithm for uni-directional communication, based on which, a modification is proposed in Section IV, yielding a communication-efficient algorithm over directed graphs without any transmission coordination among the agents. The paper ends with some concluding remarks in Section V, followed by an appendix.

II. Problem Formulation

In this section, we introduce the background and formulation of the MARL problem with networked agents.

A. Networked Multi-Agent MDP

Consider a team of N agents, denoted by 𝒩={1,2,,N}, operating in a common environment. It is assumed that no central controller that can either collect rewards or make the decisions for the agents exists. In contrast, the agents are connected by a possibly time-varying communication network depicted by an undirected graph 𝒢t=(𝒩,t), where the set of communication links at time t is denoted by t. Then, a networked multi-agent MDP model can be defined by a tuple (𝒮,{𝒜i}i𝒩,P,{Ri}i𝒩,{𝒢t}t0), where 𝒮 is the state space shared by all the agents in 𝒩, 𝒜i is the action space of agent i, and {𝒢t}t0 is a sequence of time-varying communication networks. For each agent i, Ri:𝒮×𝒜 is the local reward function, where 𝒜=i=1N𝒜i is the joint action space. P:𝒮×𝒜×𝒮[0,1] denotes the state transition probability of the MDP. It is assumed throughout the paper that the states are globally observable and but the rewards are observed only locally.

The networked multi-agent MDP evolves as follows. Each agent i chooses its own action ati given state st at time t, according to a local policy, i.e., the probability of choosing action ai at state s, πi:𝒮×𝒜i[0,1]. Note that the joint policy of all agents, π:𝒮×𝒜[0,1], satisfies π(s,a)=i𝒩πi(s,ai). Also, a reward rt+1i is received by agent i after executing the action. To make the search of the optimal joint policy tractable, we assume that the local policy is parameterized by πθii, where θiΘi is the parameter, and Θimi is a compact set. The parameters are concatenated as θ=[(θ1),,(θN)]Θ, where Θ=i=1NΘi. The joint policy is thus given by πθ(s,a)=i𝒩πθii(s,ai). We first make a standard regularity assumption on the model and the policy parameterization.

Assumption 1.

For any iN, sS, and aiAi, the policy function πθii(s,ai)>0 for any θiΘi. Also, πθii(s,ai) is continuously differentiable with respect to the parameter θi over Θi. In addition, for any θΘ, let Pθ be the transition matrix of the Markov chain {st}t0 induced by policy πθ, that is, for any s,sS

Pθ(s|s)=a𝒜πθ(s,a)P(s|s,a). (1)

We assume that the Markov chain {st}t0 is irreducible and aperiodic under any πθ, with the stationary distribution denoted by dθ.

Assumption 1 has been made in the existing work on actor-critic (AC) algorithms with function approximation [21, 22]. It implies that the Markov chain of the state-action pair {(st,at)}t0 has a stationary distribution dθ(s)πθ(s,a) for any s𝒮 and a𝒜.

The objective of the agents is to collaboratively find a policy πθ that maximizes the globally averaged long-term return over the network based solely on local information, namely,

maxθJ(θ)=  limT1T𝔼(t=0T-11Ni𝒩rt+1i)
=  s𝒮,a𝒜dθ(s)πθ(s,a)R¯(s,a), (2)

where R¯(s,a)=N-1i𝒩Ri(s,a) is the globally averaged reward function. Let r¯t=N-1i𝒩rti; then, we have R¯(s,a)=𝔼[r¯t+1|st=s,at=a]. Thus, the global relative action-value function under policy πθ can be defined accordingly as

Qθ(s,a)=t𝔼[r¯t+1-J(θ)|s0=s,a0=a,πθ],

and the global relative state-value function Vθ(s) is defined as Vθ(s)=a𝒜πθ(s,a)Qθ(s,a). For simplicity, hereafter we will refer to Vθ and Qθ as state-value function and action-value function only. Furthermore, the advantage function can be defined as Aθ(s,a)=Qθ(s,a)-Vθ(s).

As the basis for developing multi-agent actor-critic algorithms, the following policy gradient theorem was established in [3] for MARL.

Policy Gradient Theorem for MARL: [Theorem 3.1 in [3]] For any θΘ and any agent i𝒩, we define the local advantage function Aθi:𝒮×𝒜 as

Aθi(s,a)=  Qθ(s,a)-V~θi(s,a-i), (3)

where V~θi(s,a-i)=ai𝒜iπθii(s,ai)Qθ(s,ai,a-i). We use a-i to denote the actions of all agents except for i. Then, the gradient of J(θ) with respect to θi is given by

θiJ(θ) =𝔼sdθ,aπθ[θilogπθii(s,ai)Aθ(s,a)]
=𝔼sdθ,aπθ[θilogπθii(s,ai)Aθi(s,a)]. (4)

B. A Motivating Algorithm

As mentioned earlier, the distributed actor-critic algorithms proposed in [3] require each agent to transmit its estimate of ω to all its neighbors at each time, which can be communication-expensive when the size of ω is very large. A natural idea to reduce the communication cost at each time step is to allow each agent to transmit only partial entries of its estimate vector to its neighbors at one time. Such an idea has been explored in distributed algorithms for solving linear algebraic equations [23] and finding a common fixed point among a family of nonlinear maps [24].

We first propose a communication-efficient algorithm based on the algorithm in [3] and then show its limitation in implementation, which serves as a motivating algorithm for our main contribution in the next sections.

The algorithm is based on the local advantage function Aθi defined in (3), which requires estimating the action-value function Qθ of policy πθ. Consider Q(,;ω):𝒮×𝒜, a family of functions parametrized by ωK, where K|𝒮||𝒜|. It is assumed that each agent i maintains its own parameter ωi and uses Q(,;ωi) as a local estimate of Qθ.

The algorithm consists of two steps, the actor step and the critic step. In the critic step, an update based on temporal difference (TD) learning is performed at each agent to estimate Q(,;ωi), followed by a linear combination of its neighbors’ parameter estimates. Different from Algorithm 1 in [3] in which the parameter sharing step is the consensus update for all N agents’ entire vectors of their estimates of ω, the algorithm here allows each agent to randomly transmit some entries of its estimate vector. For simplicity, suppose that each agent randomly picks one entry of its estimate vector at each time and then transmits the entry to its neighbors. Then, the critic step involves a consensus update for each entry k of the estimate vector, depending on which agents transmit the corresponding entry at time t, which involves a weight matrix Ctk=[ctk(i,j)]N×N, where ctk(i,j) is the weight on the k-th entry transmitted from agent j to agent i at time t. Specifically, the critic step iterates as follows:

μt+1i =(1-βω,t)μti+βω,trt+1i,
ω~ti =ωti+βω,tδtiωQt(ωti), (5)
ωt+1ik =j𝒩ctk(i,j)ω~tjk,k{1,2,,K},

where ωtik and ω~tik denote the k-th entry of ωti and ω~ti, respectively, μti tracks the long-term return of agent i, βω,t>0 is the stepsize, Qt(ω)=Q(st,at;ω) for any ω, and the local action-value TD-error δti in (5) is defined as

δti=rt+1i-μti+Qt+1(ωti)-Qt(ωti). (6)

The actor step is motivated by (A.) and is the same as that of Algorithm 1 in [3].

In the special case when all weight matrices Ctk=[ctk(i,j)]N×N are the same, i.e., Ctk=Ct=[ct(i,j)]N×N, k{1,,K}, the algorithm simplifies to Algorithm 1 in [3], i.e., all agents’ estimate vectors are transmitted simultaneously at each time. For this case, it has been shown in [3] that the algorithm will guarantee the convergence upon the following assumption on the weight matrix Ct.

Assumption 2.

The sequence of random matrices {Ct}t0 satisfies the following conditions:

  1. 1.

    Ct is row stochastic and 𝔼(Ct) is column stochastic.11 1 A nonnegative matrix is called row (or column) stochastic if all its row (or column) sums equal to one. There exists a constant η>0 such that ct(i,j)η for any ct(i,j)>0.

  2. 2.

    Ct respects the communication graph 𝒢t, i.e., ct(i,j)=0, if (j,i)t.

  3. 3.

    The spectral norm of 𝔼[Ct(I-𝟏𝟏/N)Ct] is strictly less than one.

  4. 4.

    Given the σ-algebra generated by the random variable before time t, Ct is conditionally independent of rt+1i for any i𝒩.

Here 𝟏N denotes the N-dimensional column vector whose entries are all equal to one.

We now consider the heterogeneous Ctk case proposed in our algorithm. Let ω=[(ω1),,(ωN)] and ω~=[(ω~1),,(ω~N)]. Then, from (5), it can be verified that

ωt+1=C¯tω~t,

where C¯t=k=1KCtk(ekek)NK×NK and ek is k-th unit vector. More can be said.

Proposition 1.

If Ctk satisfies Assumption 2 for all k{1,,K}, then C¯t does as well.

The proof of this proposition can be found in the appendix.

The above proposition implies that the proposed communication-efficient algorithm can also guarantee the convergence if each entry’s weight matrix Ctk satisfies Assumption 2.22 2 The proof is essentially the same as that in [3]. However, satisfying the assumption leads to restriction in implementation. Specifically, there only exist three distributed approaches to implement a weight matrix satisfying the assumption – the Metropolis algorithm [25], pairwise gossiping [26], and broadcast gossip [27]. However, the broadcast gossip cannot always guarantee average consensus and thus cannot be applied to the distributed RL problem considered here, and both the Metropolis algorithm and pairwise gossiping require bi-directional communication between each pair of neighboring agents. Thus, the implementation of the above communication-efficient algorithm requires bi-directional communication, i.e., the communication graph 𝒢t must be undirected. Moreover, when an agent i receives the k-th entry from its neighbor j’s estimate, agent i must send its k-th entry to agent j as well. Since each agent randomly picks one of its entries, in the worst scenario when each agent picks a distinct entry, the above requirement may lead all the agents to transmit N entries at one time. To get around this limitation, we will propose a multi-agent actor-critic algorithm for directed graphs in the next section, and then modify it to be communication-efficient in Section IV.

III. Multi-Agent Actor-Critic over Directed Graphs

In this section, we propose a multi-agent actor-critic algorithm for directed communication graphs by exploiting the idea of the push-sum protocol [28]. For simplicity, we focus on fixed graphs which are assumed to be strongly connected. The algorithms and their convergence results can be extended to time-varying directed graphs with a certain mild condition on joint strong connectivity.

Let each agent i have control over a scalar-valued variable yti whose initial value y0i=1. The critic step of the algorithm iterates as follows:

μt+1i =(1-βω,t)μti+βω,trt+1i,
ω~ti =ωti+βω,tδ~tizQt(zti),
ωt+1i =j𝒩b(i,j)ω~tj,
yt+1i =j𝒩b(i,j)ytj, (7)
zt+1i =ωt+1iyt+1i,

where μti tracks the long-term average return of agent i, βω,t>0 is the stepsize, Qt(z) denotes Q(st,at;z) for any z, and b(i,j) will be specified shortly. The local action-value TD-error δ~ti in (7) is given by

δ~ti=rt+1i-μti+Qt+1(zti)-Qt(zti). (8)

As for the actor step, each agent i improves its policy via

θt+1i=θti+βθ,tAtiψti, (9)

where βθ,t>0 is the stepsize, and Ati and ψti are defined as

Ati =Qt(zti)-ai𝒜iπθtii(st,ai)Q(st,ai,at-i;zti),
ψti =θilogπθtii(st,ati). (10)

Let 𝒢=(𝒩,) be the underlying communication graph. Then, b(i,j) is given as follows: For all i,j𝒩,

b(i,j)=  (1+dj)-1,if (j,i),
b(i,j)=  0,if (j,i),

where dj is the number of out-going neighbors33 3 Suppose that (i,j) is a directed edge in graph 𝒢. We say that agent i is an in-neighbor of agent j, and that agent j is an out-neighbor of agent i. of agent j, or equivalently, the out-degree of vertex j in 𝒢. Let B be the matrix whose ij-th entry is b(i,j). Then, it is easy to see that

  1. 1.

    B is column stochastic, and there exists a constant η>0 such that b(i,j)η for any b(i,j)>0;

  2. 2.

    B respects the communication graph 𝒢;

  3. 3.

    Given the σ-algebra generated by the random variable before time t, B is conditionally independent of rt+1i for any i𝒩.

We impose the following assumptions for the actor-critic algorithm which are either mild or standard; see [3] for detailed discussions on these assumptions.

Assumption 3.

The instantaneous reward rti is uniformly bounded for any iN and t0.

Assumption 4.

The stepsizes βω,t and βθ,t satisfy

tβω,t=tβθ,t=,
tβω,t2+βθ,t2<.

In addition, βθ,t=o(βω,t), and limtβω,t+1βω,t-1=1.

Assumption 5.

For each agent i, the function Q(s,a;z) is parametrized as Q(s,a;z)=zϕ(s,a), where ϕ(s,a)=[ϕ1(s,a),,ϕK(s,a)]RK is the feature associated with (s,a). The feature vector ϕ(s,a) is uniformly bounded for any sS,aA. Furthermore, the feature matrix ΦR|S||A|×K has full column rank, where the k-th column of Φ is [ϕk(s,a),sS,aA] for any k[K]. Also, for any uRK, Φu1K.

Assumption 6.

The update of the policy parameter θti includes a local projection operator, Γi:RmiΘiRmi, that projects any θti onto the compact set Θi. Also, we assume that Θ=i=1NΘi is large enough to include at least one local minimum of J(θ).

For simplicity, we define Pθ(s,a|s,a)=P(s|s,a)πθ(s,a)44 4 With slight abuse of notation, the expression Pθ has the same form as the transition probability matrix of the Markov chain {st}t0 under policy πθ (see the definition in (1)). These two matrices can be easily differentiated by the context., Dθs,a=diag[dθ(s)πθ(s,a),s𝒮,a𝒜], and R¯=[R¯(s,a),s𝒮,a𝒜]|𝒮||𝒜|. We then define the operator TθQ:|𝒮||𝒜||𝒮||𝒜| for any action-value vector Q|𝒮||𝒜| as

TθQ(Q)=R¯-J(θ)𝟏|𝒮||𝒜|+PθQ. (11)

We also define the vector Γ^i() as

Γ^i[g(θ)]=lim0<η0{Γi[θi+ηg(θ)]-θi}/η (12)

for any θΘ and g:Θi𝒩mi a continuous function. In case the limit above is not unique, Γ^i[g(θ)] is defined as the set of all possible limit points of (12).

With the above notation, we have the convergence of the critic step (7) – (8) and actor step (9) – (10) given policy πθ as follows.

Theorem 1.

Suppose that Assumptions 1 and 3-5 hold, and that communication graph G is strongly connected. Then, for any given policy πθ, with the sequences {μti} and {zti} generated from (7) and (8), we have limtiNμtiN-1=J(θ) and limtzti=ωθ almost surely for any iN, where J(θ) is the globally averaged return as defined in (A.), and ωθ is the unique solution to

ΦDθs,a[TθQ(Φωθ)-Φωθ]=0.

Suppose further that Assumption 6 holds. Then, the sequence {θti} obtained from (9) converges almost surely to a point in the set of the asymptotically stable equilibria of

θ˙i=Γ^i[𝔼stdθ,atπθ(At,θiψt,θi)],i𝒩.

The algorithm in this section works for directed communication graphs, but still requires each agent to transmit its entire estimate vector. In the next section, we will modify the algorithm to significantly reduce the communication cost at each time. The theorem just stated is a special case of the theorem in the next section.

IV. A Communication-Efficient Algorithm over Directed Graphs

In this section, we present a communication-efficient distributed actor-critic algorithm, modified from the algorithm in the last section, in which each agent can independently transmit one scaled entry of its state vector, and in total only transmit two scalars at each iteration.

In our communication-efficient algorithm, each agent i randomly picks one of its entries, k, of its estimate vector at time t with probability ptik, and then transmit its scaled value to its out-neighbors. For simplicity, we assume that ptik=pik for all t, and that pik>0 and k=1Kpik=1. For each entry k, each agent i has control over a scalar-valued variable ytik whose initial value is y0ik=1.

The critic step iterates as follows:

μt+1i =(1-βω,t)μti+βω,trt+1i,
ω~ti =ωti+βω,tδ~tizQt(zti),
ωt+1ik =j𝒩btk(i,j)ω~tjk,k{1,2,,K},
yt+1ik =j𝒩btk(i,j)ytjk,k{1,2,,K}, (13)
zt+1ik =ωt+1ikyt+1ik,

where ωtik and ω~tik denote the k-th entry of ωti and ω~ti, respectively, μti tracks the long-term return of agent i, βω,t>0 is the stepsize, Qt(z)=Q(st,at;z) for any z, and btk(i,j) will be specified shortly. The local action-value TD-error δ~ti in (13) is given by

δ~ti=rt+1i-μti+Qt+1(zti)-Qt(zti). (14)

As for the actor step, each agent i improves its policy via

θt+1i=θti+βθ,tAtiψti, (15)

where βθ,t>0 is the stepsize. Moreover, Ati and ψti are defined as

Ati =Qt(zti)-ai𝒜iπθtii(st,ai)Q(st,ai,at-i;zti),
ψti =θilogπθtii(st,ati).

Let 𝒢=(𝒩,) be the underlying communication graph. Then, btk(i,j) is given as follows. For all i,j𝒩, btk(i,j)=(1+dj)-1 if (j,i) and agent i picks k-th entry at time t; otherwise, btk(i,j)=0, where dj is the number of out-going neighbors of agent j, or equivalently, the out-degree of vertex j in 𝒢. Let Btk be the matrix whose ij-th entry is btk(i,j). Then, it is easy to see that

  1. 1.

    Each Btk is column stochastic, and there exists a constant η>0 such that btk(i,j)η for any btk(i,j)>0;

  2. 2.

    Each Btk respects the communication relationship, i.e., btk(i,j)=0 if agent i does not receive information from agent j at time t;

  3. 3.

    Given the σ-algebra generated by the random variable before time t, each Btk is conditionally independent of rt+1i for any i𝒩.

It is worth emphasizing that in our communication-efficient algorithm, each agent j only needs to transmit two scalars, ω~tjk1+dj and ytjk1+dj, at each iteration, and all the computations are in a fully distributed manner.

Theorem 2.

Suppose that Assumptions 1 and 3-5 hold, and that communication graph G is strongly connected. Then, for any given policy πθ, with the sequences {zti} and {μti} generated from (13) and (14), we have limtiNμtiN-1=J(θ) and limtzti=ωθ almost surely for any iN, where J(θ) is the globally averaged return as defined in (A.), and ωθ is the unique solution to

ΦDθs,a[TθQ(Φωθ)-Φωθ]=0.

Suppose further that Assumption 6 holds. Then, the sequence {θti} obtained from (15) converges almost surely to a point in the set of the asymptotically stable equilibria of

θ˙i=Γ^i[𝔼stdθ,atπθ(At,θiψt,θi)],i𝒩.

To prove the above theorem, we need the following concepts and results.

Define the operator :KNK as

x=1N(𝟏NIK)x=1Ni𝒩xi

for any x=[(x1),,(xN)]KN with xiK for all i𝒩.

Lemma 1.

For all iN, limtzti=limtωt=limtzt.

Proof: From the update of each agent i in (13),

zt+1ik =ωt+1ikyt+1ik
=j𝒩btk(i,j)ωtjk+btk(i,j)βω,tu~t+1jkj𝒩btk(i,j)ytjk
=j𝒩btk(i,j)ωtjk+btk(i,j)βω,tu~t+1jkl𝒩btk(i,l)ytlk
=j𝒩(btk(i,j)ωtjk+btk(i,j)βω,tu~t+1jk)/(btk(i,j)ytjk)1+l𝒩,ljbtk(i,l)ytlk/(bt(i,j)ytjk)
=j𝒩ztjk+βω,tu~t+1jk/ytjk1+l𝒩,ljbtk(i,l)ytlk/(btk(i,j)ytjk)
=j𝒩stk(i,j)(ztjk+βω,tϵtjk),

where

stk(i,j)=(1+l𝒩,ljbtk(i,l)ytlk/(btk(i,j)ytjk))-1,

u~t+1i=δ~tizQ(zti), and ϵtik=u~t+1ik/ytik. Let St=k=1KStk(ekek), zt=[(zt1),,(ztN)], and ϵt=[(ϵt1),,(ϵtN)], where Stk=[stk(i,j)]N×N, zti=[zti1,,ztik], and ϵti=[ϵti1,,ϵtiK]. Then, we have St𝟏=𝟏, and

zt+1=St(zt+βω,tϵt). (16)

From Lemma 1 (b) in [29], we know that limtzti=limtω~t. Since limtβω,tδtizQ(zti)=limtβω,tδtiϕ(st,at)=0, we have limtω~ti=limtωti. Thus, limtzti=limtω~t=limtωt, and limt(zt-ωt)=0.  

Let yt=[(yt1),,(ytN)], where yti=[yti1,,ytiK]. Then, we have yt+1=Btyt.

Lemma 2.

Suppose that communication graph G is strongly connected. There exists a constant α>0 such that αytikN for any i,k,t almost surely.

Proof: The existence of a uniform lower bound of ytik is a consequence of Lemma 3 in [30]. Since y0ik=1 for all i,k and each Btk is column stochastic, i=1Nytik=i=1Ny0ik=N for all t. It follows that αytikN for any i,k,t.  

Lemma 3.

Under Assumptions 1 and 3, the sequence {μti} generated as in (13) is bounded almost surely.

Proof: The proof of the lemma is the same as that of Lemma 5.2 in [3].  

Lemma 4.

Under Assumptions 1 and 3-5, the sequence {zti} is bounded almost surely, i.e., suptzti<.

Proof: Recall that the update of z is zt+1=St(zt+βω,tϵt) given in (16). Let hi(zti,μti,yti,st,at)=𝔼(ϵti|t,1), Mt+1i=ϵti-𝔼(ϵti|t,1). Since the Markov chain {(st,at)}t0 is irreducible and aperidic given policy πθ, we have h¯i(ωti,μti,yti)=𝔼stdθ,atπθ[hi(zti,μti,yti,st,at)]=ΦDθs,a[R~i-𝟏Nμ~ti+(PθΦ-Φ)z~ti], where R~i=[Ri1yti1,,RiKytiK,,Ri((N-1)K+1)yti1,,Ri(NK)ytiK], μ~ti=[μtiyti1,,μtiytiK] and z~ti=[zti1yti1,,ztiKytiK].

From Assumptions 3 and 5, and Lemmas 2 and 3, we know that K1,K2>0, s.t. ϕtkytikK1 and rt+1i-μtiK2,k,i. Thus, K3>0 such that h¯i(ωti,μti,yti)-hi(zti,μti,yti,st,at)2K3(1+ωt2). Moreover, we know hi(zti,μti,yti,st,at) is Lipschitz continuous in zti, and Mt+1i is martingale difference sequence. Since each Btk is column stochastic, it has bounded norm. Thus, by Theorem A.2 in [3], it follows that zti is bounded almost surely.  

Lemma 5.

Under Assumptions 1 and 3-5, the sequence {ωti} is bounded almost surely, i.e., suptωti<.

Proof: From (13), we know that for each entry k in ωt, ωtk=ztkytk, k{1,,NK}. Moreover, from Lemmas 2 and 4, zt and yt are bounded almost surely. Therefore, it is easy to show that ωt is also bounded almost surely.  

We are now in a position to prove Theorem 2.

Proof of Theorem 2: Let {t,1} be the filtration with t,1=σ(rτ,μτ,ωτ,zτ,yτ,sτ,aτ,Bτ-1,τ<t). Let vti=[μti,(ωti)] and vt=[(vt1),,(vtN)]. Then, the iteration of ωt has the following form:

ωt+1 =1N(𝟏NIK)Bt(ωt+βω,tu~t+1)
=1N(𝟏NIK)(ωt+βω,tu~t+1)
=ωt+βω,tu~t+1
=ωt+βω,tu~t+1
=ωt+βω,tδ~tϕt.

Hence, the updates for ωt and μt are

μt+1 =μt+βω,t𝔼(r¯t+1-μt|t,1)+βω,tξt+1,1, (17)
ωt+1 =ωt+βω,t𝔼(δtϕt|t,1)+βω,tξt+1,2+βω,tγt+1, (18)

where ξt+1,1=r¯t+1-𝔼(r¯t+1-μt|t,1), ξt+1,2=δtϕt-𝔼(δtϕt|t,1), and γt+1=δ~tϕt-δtϕt.

Note that 𝔼(r¯t+1-μt|t,1) is Lipschitz continuous in μt, and that 𝔼(δtϕt|t,1) is Lipschitz continuous in both ωt and μt. Moreover, ξt+1,1 and ξt+1,2 are martingale differences sequences. From Lemmas 2 and 5, {γt} is a bounded random sequence with γt0 as t almost surely.

From Theorem B.2 in [3], the following ODE captures the asymptotic behavior of (17) and (18):

v˙ =[μ˙ω˙]
=[-10-ΦDθs,a𝟏NKΦDθs,a(Pθ-INK)Φ][μω]+[J(θ)ΦDθs,aR¯] (19)

where Dθs,a=diag[dθ(s)πθ(s,a),s𝒮,a𝒜]. From the proof of Theorem 4.6 in [3], the ODE (19) is globally asymptotically stable and has its equilibrium satisfying

{μ=J(θ)ΦDθs,a[R¯-μ𝟏NK+PθΦω-Φω]=0. (20)

Note that the solution for μ at equilibrium is J(θ), and that the solution for ω has the form ωθ+lv with any l and vK such that ϕv=𝟏K, where ωθ follows that ΦDθs,a[TθQ(Φωθ)-Φωθ]=0. Moreover, ϕv𝟏K by Assumption 5, so ωθ is the unique solution, which implies that limtωt=ωθ. Combining the above facts with Lemma 1, we conclude that limtzti=ωθ.

As for the actor step convergence, the proof is the same as that of Theorem 4.7 in [3].  

V. Conclusions

This paper has proposed a communication-efficient distributed reinforcement learning algorithm. We have shown that the algorithm allows each agent to only transmit two scalar-valued variables, one of which is an independently selected entry of the agent’s estimate vector, at each time, and works for any strongly connected graph, which significantly reduces communication cost at one time compared with the algorithms in [3]. It is fairly straightforward to extend the algorithm and its convergence result to the case where each agent transmits more than one entry of its estimate vector. It is also fairly straightforward to extend the proposed algorithm to an asynchronous case without communication delays as was done in [31]. In the case when communication delays are taken into account, a modified version of the algorithm here is expected to solve the problem using the idea in [32]. Future directions of this work include comparison of total communication cost with the algorithms in [3], extension to time-varying communication graphs, and development of asynchronous versions of the algorithm.

VI. Appendix

Proof of Proposition 1: (1) Since all Ctk(t) satisfy condition 1) in Assumption 2, it follows that

C¯t𝟏NK =k=1K(Ctk(ekek))(𝟏N𝟏K)
=k=1K(Ctk𝟏N)(ekek𝟏K)=𝟏NK,
𝟏NK𝔼[C¯t] =k=1K(𝟏N𝟏K)(𝔼[Ctk](ekek))
=k=1K(𝟏N𝔼[Ctk])(𝟏K(ekek))=𝟏NK.

From the definition of C¯t, for any entry (i,j) of Ctk, c¯t((i-1)K+k,(j-1)K+k)=ctk(i,j). Since for any ctk(i,j)>0, ctk(i,j)η,k{1,,K},i,j{1,,N}, each entry of C¯t also satisfies condition 1).

(2) From the definition of C¯t, since all Ctk satisfy condition 2) in Assumption 2, so does C¯t.

(3) First note that

𝔼[C¯t(IN-𝟏N𝟏N/N)IKC¯t]
=𝔼[(k=1KCtk(ekek))(IN-𝟏N𝟏N/N)(k=1Kekek)(k=1KCtk(ekek))]
=𝔼[(k=1K(Ctk)(ekek))(k=1K(IN-𝟏N𝟏N/N)(ekek))(k=1KCtk(ekek))]
=𝔼[(k=1K((Ctk)(IN-𝟏N𝟏N/N)Ctk)(ekek))].

There exists a permutation matrix D such that

D𝔼[(k=1K((Ctk)(IN-𝟏N𝟏N/N)Ctk)(ekek))]D
=𝔼[diag{((Ct1)(IN-𝟏N𝟏N/N)Ct1),,((CtK)(IN-𝟏N𝟏N/N)CtK)}].

Thus, the spectral norm of 𝔼[(k=1K((Ctk)(IN-𝟏N𝟏N/N)Ctk)(ekek))] is the same as that of 𝔼[diag{((Ct1)(IN-𝟏N𝟏N/N)Ct1),,((CtK)(IN-𝟏N𝟏N/N)CtK)}]. Since the spectral norm of
𝔼[(Ctk)(IN-𝟏N𝟏N/N)Ctk)] is strictly less than one for all k, so is the spectral norm of

𝔼[diag{((Ct1)(IN-𝟏N𝟏N/N)Ct1),,((CtK)(IN-𝟏N𝟏N/N)CtK)}].

Thus, the spectral norm of 𝔼[C¯t(IN-𝟏N𝟏N/N)IKC¯t] is strictly less than one.

(4) From the definition of C¯t, it is easy to see that C¯t is conditionally independent of rt+1i for any i as all Ctk(t) are. This completes the proof.  

References

  • [1] G. Mateos, J.A. Bazerque, and G.B. Giannakis. Distributed sparse linear regression. IEEE Transactions on Signal Processing, 52(10):5262–5276, 2010.
  • [2] R.K. Kolla, K. Jagannathan, and A. Gopalan. Collaborative learning of stochastic bandits over a social network. IEEE Transactions on Networking, 26(4):1782–1795, 2018.
  • [3] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Başar. Fully decentralized multi-agent reinforcement learning with networked agents. In International Conference on Machine Learning, pages 5872–5881, 2018.
  • [4] Z. Jiang, A. Balu, C. Hegde, and S. Sarkar. Collaborative deep learning in fixed topology networks. In Advances in Neural Information Processing Systems, pages 5904–5914, 2017.
  • [5] C. Boutilier. Planning, learning and coordination in multi-agent decision processes. In Conference on Theoretical Aspects of Rationality and Knowledge, pages 195–210, 1996.
  • [6] M. Lauer and M. Riedmiller. An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In International Conference on Machine Learning, pages 535–542, 2000.
  • [7] M.L. Littman. Value-function reinforcement learning in Markov games. Cognitive Systems Research, 2(1):55–66, 2001.
  • [8] X. Wang and T. Sandholm. Reinforcement learning to play an optimal Nash equilibrium in team Markov games. In Advances in Neural Information Processing Systems, pages 1603–1610, 2003.
  • [9] S. Kar, J.M. Moura, and H.V. Poor. QD-learning: A collaborative distributed strategy for multi-agent reinforcement learning through consensus + innovations. IEEE Transactions on Signal Processing, 61(7):1848–1862, 2013.
  • [10] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Başar. Finite-sample analyses for fully decentralized multi-agent reinforcement learning. arXiv preprint arXiv:1812.02783, 2018.
  • [11] D. Lee, H. Yoon, V. Cichella, and N. Hovakimyan. Stochastic primal-dual algorithm for distributed gradient temporal difference learning. arXiv preprint arXiv:1805.07918, 2018.
  • [12] T.T. Doan, S.T. Maguluri, and J. Romberg. Convergence rates of distributed TD (0) with linear function approximation for multi-agent reinforcement learning. arXiv preprint arXiv:1902.07393, 2019.
  • [13] J. Hu and M.P. Wellman. Nash Q-learning for general-sum stochastic games. Journal of Machine Learning Research, 4(11):1039–1069, 2003.
  • [14] J. Foerster, Y.M. Assael, N. Freitas, and S. Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, pages 2137–2145, 2016.
  • [15] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. arXiv preprint arXiv:1706.02275, 2017.
  • [16] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian. Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In International Conference on Machine Learning, pages 2681–2690, 2017.
  • [17] K. Zhang, Z. Yang, and T. Başar. Networked multi-agent reinforcement learning in continuous spaces. In Proceedings of IEEE Conference on Decision and Control, pages 2771–2776. IEEE, 2018.
  • [18] Y. Shoham, R. Powers, and T. Grenager. Multi-agent reinforcement learning: A critical survey. Technical Report, 2003.
  • [19] D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computation of aggregate information. In Proceedings of the 44th IEEE Symposium on Foundations of Computer Science, pages 482–491, 2003.
  • [20] T. Chen, K. Zhang, G. B. Giannakis, and T. Başar. Communication-efficient distributed reinforcement learning. arXiv preprint arXiv:1812.03239, 2018.
  • [21] V.R. Konda and J.N. Tsitsiklis. Actor-critic algorithms. In Advances in Neural Information Processing Systems, pages 1008–1014, 2000.
  • [22] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor-critic algorithms. Automatica, 45(11):2471–2482, 2009.
  • [23] X. Gao, J. Liu, and T. Başar. Stochastic communication-efficient distributed algorithms for solving linear algebraic equations. In Proceedings of the IEEE Multi-Conference on Systems and Control, pages 380–385, 2016.
  • [24] D. Fullmer, L. Wang, and A.S. Morse. On the distributed computation of a common fixed point of a family of paracontractions. In Proceedings of IEEE Conference on Decision and Control, pages 2289–2293, 2017.
  • [25] L. Xiao, S. Boyd, and S. Lall. A scheme for robust distributed sensor fusion based on average consensus. In International Symposium on Information Processing in Sensor Networks, pages 63–70, 2005.
  • [26] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms. IEEE/ACM Transactions on Networking, 14(SI):2508–2530, 2006.
  • [27] T.C. Aysal, M.E. Yildiz, A.D. Sarwate, and A. Scaglione. Broadcast gossip algorithms for consensus. IEEE Transactions on Signal Processing, 57(7):2748–2761, 2009.
  • [28] D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computation of aggregate information. In Proceedings of IEEE Symposium on Foundations of Computer Science, pages 482–491, 2003.
  • [29] A. Nedić and A. Olshevsky. Distributed optimization over time-varying directed graphs. IEEE Transactions on Automatic Control, 60(3):601–615, 2015.
  • [30] P. Rezaienia, B. Gharesifard, T. Linder, and B. Touri. Convergence rate of push-sum algorithms on random graphs. In Proceedings of IEEE Conference on Decision and Control, pages 4218–4223, 2018.
  • [31] J. Liu and A.S. Morse. Asynchronous distributed averaging using double linear iterations. In Proceedings of the American Control Conference, pages 6620–6625, 2012.
  • [32] C.N. Hadjicostis and T. Charalambous. Average consensus in the presence of delays in directed graph topologies. IEEE Transactions on Automatic Control, 59(3):763–768, 2014.