This paper considers a distributed reinforcement learning problem in which anetwork of multiple agents aim to cooperatively maximize the globally averagedreturn through communication with only local neighbors. A randomizedcommunication-efficient multi-agent actor-critic algorithm is proposed forpossibly unidirectional communication relationships depicted by a directedgraph. It is shown that the algorithm can solve the problem for stronglyconnected graphs by allowing each agent to transmit only two scalar-valuedvariables at one time.
Quick Read (beta)
A Communication-Efficient Multi-Agent Actor-Critic Algorithm for Distributed Reinforcement Learning
This paper considers a distributed reinforcement learning problem in which a network of multiple agents aim to cooperatively maximize the globally averaged return through communication with only local neighbors. A randomized communication-efficient multi-agent actor-critic algorithm is proposed for possibly unidirectional communication relationships depicted by a directed graph. It is shown that the algorithm can solve the problem for strongly connected graphs by allowing each agent to transmit only two scalar-valued variables at one time.
Recently, there has been increasing interest in developing distributed machine learning algorithms. Notable examples include distributed linear regression , multi-arm bandit , reinforcement learning (RL) , and deep learning . Such algorithms have promising applications in large-scale networks, such as social platforms, online economic networks, cyber-physical systems, and Internet of Things, primarily because in such a complex network, it is impossible to collect all the information at the same point and each component of the network may not be willing to share its private information due to privacy issues.
Multi-agent reinforcement learning (MARL) problems have recently received increasing attention. In general, MARL problems are investigated in settings that are either collaborative, competitive, or a mixture of the two. For collaborative MARL, the most rudimentary framework is the canonical multi-agent Markov decision process [5, 6], where the agents share a common reward function that is determined by the joint actions of all agents. Another notable framework for collaborative MARL is the team Markov game model, also with a shared reward function among agents [7, 8]. These two frameworks were then extended to the setting where agents are allowed to have heterogeneous reward functions [9, 3, 10, 11, 12], collaborating with the goal of maximizing the long-term return corresponding to the team averaged reward. In particular, these works focused on a fully-decentralized/distributed setting, where there exists no central controller to coordinate the agents to achieve the overall team goal. Instead, the agents are connected via a communication network and are only able to exchange information with the neighbors on the network. There is also an ever-growing number of works on MARL in competitive and mixed settings [13, 14, 15, 16], where most of the recent ones are empirical works without theoretical convergence guarantees. In this work, we focus on the decentralized/distributed and collaborative MARL setting with networked agents as in [9, 3, 17].
Within this setting, the work of  proposed the first fully distributed actor-critic algorithm. The algorithm allows the agents to exchange information over a communication network with possibly sparse connectivity at each agent, which improves the scalability of the multi-agent model with a high population of agents and thus tackles one of the long-standing challenges in general MARL problems . A detailed comparison of the problem setting to the other existing ones on multi-agent and collaborative MARL is provided in .
A possible communication issue of the algorithms proposed in  is that they require each agent to transmit the entire vector of its estimate of to its neighbors at each time. Such communication-costly algorithms may not be possible to secure in some learning applications, for example, when the size of is very large but each agent has limited communication capacity. In this paper, we propose an approach in which each agent broadcasts only one (scaled) entry of its estimate of , thus significantly reducing communication costs at each iteration. The algorithms in  more or less rely on doubly stochastic matrices, which implicitly requires bidirectional communication between each pair of neighboring agents. This requirement restricts the applications of the algorithm in scenarios with possibly uni-directional communication. To get around this limitation, we propose a variant using the idea of push-sum  with which each agent only needs to transmit two scalar-valued variables at each time, and in particular, each agent can independently determine which entry of the estimate of to transmit. For simplicity, we take a randomized way and show that with this very cheap communication scheme, the algorithm can solve the distributed reinforcement learning problem for any strongly connected graph that depicts the communication relationships among the agents. Note that in parallel with this work,  appeared to be the first one that considers the communication efficiency in MARL and distributed RL in general. However, the setting in  assumes the existence of a central controller. In contrast, to the best of our knowledge, our work proposes the first communication-efficient MARL/distributed RL algorithm for networked agents.
The remaining part of this paper is organized as follows. Section II introduces the MARL problem and presents a communication-efficient variant of the algorithm in , which essentially requires bi-directional communication and transmission coordination between neighboring agents. Motivated by the restrictive requirements, Section III proposes a multi-agent actor-critic algorithm for uni-directional communication, based on which, a modification is proposed in Section IV, yielding a communication-efficient algorithm over directed graphs without any transmission coordination among the agents. The paper ends with some concluding remarks in Section V, followed by an appendix.
II. Problem Formulation
In this section, we introduce the background and formulation of the MARL problem with networked agents.
A. Networked Multi-Agent MDP
Consider a team of agents, denoted by , operating in a common environment. It is assumed that no central controller that can either collect rewards or make the decisions for the agents exists. In contrast, the agents are connected by a possibly time-varying communication network depicted by an undirected graph , where the set of communication links at time is denoted by . Then, a networked multi-agent MDP model can be defined by a tuple , where is the state space shared by all the agents in , is the action space of agent , and is a sequence of time-varying communication networks. For each agent , is the local reward function, where is the joint action space. denotes the state transition probability of the MDP. It is assumed throughout the paper that the states are globally observable and but the rewards are observed only locally.
The networked multi-agent MDP evolves as follows. Each agent chooses its own action given state at time , according to a local policy, i.e., the probability of choosing action at state , . Note that the joint policy of all agents, , satisfies . Also, a reward is received by agent after executing the action. To make the search of the optimal joint policy tractable, we assume that the local policy is parameterized by , where is the parameter, and is a compact set. The parameters are concatenated as , where . The joint policy is thus given by We first make a standard regularity assumption on the model and the policy parameterization.
For any , , and , the policy function for any . Also, is continuously differentiable with respect to the parameter over . In addition, for any , let be the transition matrix of the Markov chain induced by policy , that is, for any
We assume that the Markov chain is irreducible and aperiodic under any , with the stationary distribution denoted by .
Assumption 1 has been made in the existing work on actor-critic (AC) algorithms with function approximation [21, 22]. It implies that the Markov chain of the state-action pair has a stationary distribution for any and .
The objective of the agents is to collaboratively find a policy that maximizes the globally averaged long-term return over the network based solely on local information, namely,
where is the globally averaged reward function. Let ; then, we have . Thus, the global relative action-value function under policy can be defined accordingly as
and the global relative state-value function is defined as . For simplicity, hereafter we will refer to and as state-value function and action-value function only. Furthermore, the advantage function can be defined as .
As the basis for developing multi-agent actor-critic algorithms, the following policy gradient theorem was established in  for MARL.
Policy Gradient Theorem for MARL: [Theorem in ] For any and any agent , we define the local advantage function as
where . We use to denote the actions of all agents except for . Then, the gradient of with respect to is given by
B. A Motivating Algorithm
As mentioned earlier, the distributed actor-critic algorithms proposed in  require each agent to transmit its estimate of to all its neighbors at each time, which can be communication-expensive when the size of is very large. A natural idea to reduce the communication cost at each time step is to allow each agent to transmit only partial entries of its estimate vector to its neighbors at one time. Such an idea has been explored in distributed algorithms for solving linear algebraic equations  and finding a common fixed point among a family of nonlinear maps .
We first propose a communication-efficient algorithm based on the algorithm in  and then show its limitation in implementation, which serves as a motivating algorithm for our main contribution in the next sections.
The algorithm is based on the local advantage function defined in (3), which requires estimating the action-value function of policy . Consider , a family of functions parametrized by , where . It is assumed that each agent maintains its own parameter and uses as a local estimate of .
The algorithm consists of two steps, the actor step and the critic step. In the critic step, an update based on temporal difference (TD) learning is performed at each agent to estimate , followed by a linear combination of its neighbors’ parameter estimates. Different from Algorithm 1 in  in which the parameter sharing step is the consensus update for all agents’ entire vectors of their estimates of , the algorithm here allows each agent to randomly transmit some entries of its estimate vector. For simplicity, suppose that each agent randomly picks one entry of its estimate vector at each time and then transmits the entry to its neighbors. Then, the critic step involves a consensus update for each entry of the estimate vector, depending on which agents transmit the corresponding entry at time , which involves a weight matrix , where is the weight on the -th entry transmitted from agent to agent at time . Specifically, the critic step iterates as follows:
where and denote the -th entry of and , respectively, tracks the long-term return of agent , is the stepsize, for any , and the local action-value TD-error in (5) is defined as
In the special case when all weight matrices are the same, i.e., , , the algorithm simplifies to Algorithm 1 in , i.e., all agents’ estimate vectors are transmitted simultaneously at each time. For this case, it has been shown in  that the algorithm will guarantee the convergence upon the following assumption on the weight matrix .
The sequence of random matrices satisfies the following conditions:
is row stochastic and is column stochastic.11 1 A nonnegative matrix is called row (or column) stochastic if all its row (or column) sums equal to one. There exists a constant such that for any .
respects the communication graph , i.e., , if .
The spectral norm of is strictly less than one.
Given the -algebra generated by the random variable before time , is conditionally independent of for any .
Here denotes the -dimensional column vector whose entries are all equal to one.
We now consider the heterogeneous case proposed in our algorithm. Let and . Then, from (5), it can be verified that
where and is -th unit vector. More can be said.
If satisfies Assumption 2 for all , then does as well.
The proof of this proposition can be found in the appendix.
The above proposition implies that the proposed communication-efficient algorithm can also guarantee the convergence if each entry’s weight matrix satisfies Assumption 2.22 2 The proof is essentially the same as that in . However, satisfying the assumption leads to restriction in implementation. Specifically, there only exist three distributed approaches to implement a weight matrix satisfying the assumption – the Metropolis algorithm , pairwise gossiping , and broadcast gossip . However, the broadcast gossip cannot always guarantee average consensus and thus cannot be applied to the distributed RL problem considered here, and both the Metropolis algorithm and pairwise gossiping require bi-directional communication between each pair of neighboring agents. Thus, the implementation of the above communication-efficient algorithm requires bi-directional communication, i.e., the communication graph must be undirected. Moreover, when an agent receives the -th entry from its neighbor ’s estimate, agent must send its -th entry to agent as well. Since each agent randomly picks one of its entries, in the worst scenario when each agent picks a distinct entry, the above requirement may lead all the agents to transmit entries at one time. To get around this limitation, we will propose a multi-agent actor-critic algorithm for directed graphs in the next section, and then modify it to be communication-efficient in Section IV.
III. Multi-Agent Actor-Critic over Directed Graphs
In this section, we propose a multi-agent actor-critic algorithm for directed communication graphs by exploiting the idea of the push-sum protocol . For simplicity, we focus on fixed graphs which are assumed to be strongly connected. The algorithms and their convergence results can be extended to time-varying directed graphs with a certain mild condition on joint strong connectivity.
Let each agent have control over a scalar-valued variable whose initial value . The critic step of the algorithm iterates as follows:
where tracks the long-term average return of agent , is the stepsize, denotes for any , and will be specified shortly. The local action-value TD-error in (7) is given by
As for the actor step, each agent improves its policy via
where is the stepsize, and and are defined as
Let be the underlying communication graph. Then, is given as follows: For all ,
where is the number of out-going neighbors33 3 Suppose that is a directed edge in graph . We say that agent is an in-neighbor of agent , and that agent is an out-neighbor of agent . of agent , or equivalently, the out-degree of vertex in . Let be the matrix whose -th entry is . Then, it is easy to see that
is column stochastic, and there exists a constant such that for any ;
respects the communication graph ;
Given the -algebra generated by the random variable before time , is conditionally independent of for any .
We impose the following assumptions for the actor-critic algorithm which are either mild or standard; see  for detailed discussions on these assumptions.
The instantaneous reward is uniformly bounded for any and .
The stepsizes and satisfy
In addition, , and .
For each agent , the function is parametrized as , where is the feature associated with . The feature vector is uniformly bounded for any . Furthermore, the feature matrix has full column rank, where the -th column of is for any . Also, for any , .
The update of the policy parameter includes a local projection operator, , that projects any onto the compact set . Also, we assume that is large enough to include at least one local minimum of .
For simplicity, we define 44 4 With slight abuse of notation, the expression has the same form as the transition probability matrix of the Markov chain under policy (see the definition in (1)). These two matrices can be easily differentiated by the context., , and . We then define the operator for any action-value vector as
We also define the vector as
for any and a continuous function. In case the limit above is not unique, is defined as the set of all possible limit points of (12).
Suppose that Assumptions 1 and 3-5 hold, and that communication graph is strongly connected. Then, for any given policy , with the sequences and generated from (7) and (8), we have and almost surely for any , where is the globally averaged return as defined in (A.), and is the unique solution to
The algorithm in this section works for directed communication graphs, but still requires each agent to transmit its entire estimate vector. In the next section, we will modify the algorithm to significantly reduce the communication cost at each time. The theorem just stated is a special case of the theorem in the next section.
IV. A Communication-Efficient Algorithm over Directed Graphs
In this section, we present a communication-efficient distributed actor-critic algorithm, modified from the algorithm in the last section, in which each agent can independently transmit one scaled entry of its state vector, and in total only transmit two scalars at each iteration.
In our communication-efficient algorithm, each agent randomly picks one of its entries, , of its estimate vector at time with probability , and then transmit its scaled value to its out-neighbors. For simplicity, we assume that for all , and that and . For each entry , each agent has control over a scalar-valued variable whose initial value is .
The critic step iterates as follows:
where and denote the -th entry of and , respectively, tracks the long-term return of agent , is the stepsize, for any , and will be specified shortly. The local action-value TD-error in (13) is given by
As for the actor step, each agent improves its policy via
where is the stepsize. Moreover, and are defined as
Let be the underlying communication graph. Then, is given as follows. For all , if and agent picks -th entry at time ; otherwise, , where is the number of out-going neighbors of agent , or equivalently, the out-degree of vertex in . Let be the matrix whose -th entry is . Then, it is easy to see that
Each is column stochastic, and there exists a constant such that for any ;
Each respects the communication relationship, i.e., if agent does not receive information from agent at time ;
Given the -algebra generated by the random variable before time , each is conditionally independent of for any .
It is worth emphasizing that in our communication-efficient algorithm, each agent only needs to transmit two scalars, and , at each iteration, and all the computations are in a fully distributed manner.
Suppose that Assumptions 1 and 3-5 hold, and that communication graph is strongly connected. Then, for any given policy , with the sequences and generated from (13) and (14), we have and almost surely for any , where is the globally averaged return as defined in (A.), and is the unique solution to
To prove the above theorem, we need the following concepts and results.
Define the operator as
for any with for all .
For all , .
Proof: From the update of each agent in (13),
, and . Let , , and , where , , and . Then, we have , and
From Lemma 1 (b) in , we know that . Since , we have . Thus, , and .
Let , where . Then, we have .
Suppose that communication graph is strongly connected. There exists a constant such that for any almost surely.
Proof: The existence of a uniform lower bound of is a consequence of Lemma 3 in . Since for all and each is column stochastic, for all . It follows that for any .
Proof: The proof of the lemma is the same as that of Lemma 5.2 in .
Proof: Recall that the update of is given in (16). Let , . Since the Markov chain is irreducible and aperidic given policy , we have , where , and .
From Assumptions 3 and 5, and Lemmas 2 and 3, we know that , s.t. and . Thus, such that . Moreover, we know is Lipschitz continuous in , and is martingale difference sequence. Since each is column stochastic, it has bounded norm. Thus, by Theorem A.2 in , it follows that is bounded almost surely.
We are now in a position to prove Theorem 2.
Proof of Theorem 2: Let be the filtration with . Let and . Then, the iteration of has the following form:
Hence, the updates for and are
where , , and .
Note that is Lipschitz continuous in , and that is Lipschitz continuous in both and . Moreover, and are martingale differences sequences. From Lemmas 2 and 5, is a bounded random sequence with as almost surely.
Note that the solution for at equilibrium is , and that the solution for has the form with any and such that , where follows that . Moreover, by Assumption 5, so is the unique solution, which implies that . Combining the above facts with Lemma 1, we conclude that .
As for the actor step convergence, the proof is the same as that of Theorem 4.7 in .
This paper has proposed a communication-efficient distributed reinforcement learning algorithm. We have shown that the algorithm allows each agent to only transmit two scalar-valued variables, one of which is an independently selected entry of the agent’s estimate vector, at each time, and works for any strongly connected graph, which significantly reduces communication cost at one time compared with the algorithms in . It is fairly straightforward to extend the algorithm and its convergence result to the case where each agent transmits more than one entry of its estimate vector. It is also fairly straightforward to extend the proposed algorithm to an asynchronous case without communication delays as was done in . In the case when communication delays are taken into account, a modified version of the algorithm here is expected to solve the problem using the idea in . Future directions of this work include comparison of total communication cost with the algorithms in , extension to time-varying communication graphs, and development of asynchronous versions of the algorithm.
From the definition of , for any entry of , . Since for any , , each entry of also satisfies condition 1).
(2) From the definition of , since all satisfy condition 2) in Assumption 2, so does .
(3) First note that
There exists a permutation matrix such that
Thus, the spectral norm of is the same as that of .
Since the spectral norm of
is strictly less than one for all , so is the spectral norm of
Thus, the spectral norm of is strictly less than one.
(4) From the definition of , it is easy to see that is conditionally independent of for any as all are. This completes the proof.
-  G. Mateos, J.A. Bazerque, and G.B. Giannakis. Distributed sparse linear regression. IEEE Transactions on Signal Processing, 52(10):5262–5276, 2010.
-  R.K. Kolla, K. Jagannathan, and A. Gopalan. Collaborative learning of stochastic bandits over a social network. IEEE Transactions on Networking, 26(4):1782–1795, 2018.
-  K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Başar. Fully decentralized multi-agent reinforcement learning with networked agents. In International Conference on Machine Learning, pages 5872–5881, 2018.
-  Z. Jiang, A. Balu, C. Hegde, and S. Sarkar. Collaborative deep learning in fixed topology networks. In Advances in Neural Information Processing Systems, pages 5904–5914, 2017.
-  C. Boutilier. Planning, learning and coordination in multi-agent decision processes. In Conference on Theoretical Aspects of Rationality and Knowledge, pages 195–210, 1996.
-  M. Lauer and M. Riedmiller. An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In International Conference on Machine Learning, pages 535–542, 2000.
-  M.L. Littman. Value-function reinforcement learning in Markov games. Cognitive Systems Research, 2(1):55–66, 2001.
-  X. Wang and T. Sandholm. Reinforcement learning to play an optimal Nash equilibrium in team Markov games. In Advances in Neural Information Processing Systems, pages 1603–1610, 2003.
-  S. Kar, J.M. Moura, and H.V. Poor. QD-learning: A collaborative distributed strategy for multi-agent reinforcement learning through consensus + innovations. IEEE Transactions on Signal Processing, 61(7):1848–1862, 2013.
-  K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Başar. Finite-sample analyses for fully decentralized multi-agent reinforcement learning. arXiv preprint arXiv:1812.02783, 2018.
-  D. Lee, H. Yoon, V. Cichella, and N. Hovakimyan. Stochastic primal-dual algorithm for distributed gradient temporal difference learning. arXiv preprint arXiv:1805.07918, 2018.
-  T.T. Doan, S.T. Maguluri, and J. Romberg. Convergence rates of distributed TD (0) with linear function approximation for multi-agent reinforcement learning. arXiv preprint arXiv:1902.07393, 2019.
-  J. Hu and M.P. Wellman. Nash Q-learning for general-sum stochastic games. Journal of Machine Learning Research, 4(11):1039–1069, 2003.
-  J. Foerster, Y.M. Assael, N. Freitas, and S. Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, pages 2137–2145, 2016.
-  R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. arXiv preprint arXiv:1706.02275, 2017.
-  S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian. Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In International Conference on Machine Learning, pages 2681–2690, 2017.
-  K. Zhang, Z. Yang, and T. Başar. Networked multi-agent reinforcement learning in continuous spaces. In Proceedings of IEEE Conference on Decision and Control, pages 2771–2776. IEEE, 2018.
-  Y. Shoham, R. Powers, and T. Grenager. Multi-agent reinforcement learning: A critical survey. Technical Report, 2003.
-  D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computation of aggregate information. In Proceedings of the 44th IEEE Symposium on Foundations of Computer Science, pages 482–491, 2003.
-  T. Chen, K. Zhang, G. B. Giannakis, and T. Başar. Communication-efficient distributed reinforcement learning. arXiv preprint arXiv:1812.03239, 2018.
-  V.R. Konda and J.N. Tsitsiklis. Actor-critic algorithms. In Advances in Neural Information Processing Systems, pages 1008–1014, 2000.
-  S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor-critic algorithms. Automatica, 45(11):2471–2482, 2009.
-  X. Gao, J. Liu, and T. Başar. Stochastic communication-efficient distributed algorithms for solving linear algebraic equations. In Proceedings of the IEEE Multi-Conference on Systems and Control, pages 380–385, 2016.
-  D. Fullmer, L. Wang, and A.S. Morse. On the distributed computation of a common fixed point of a family of paracontractions. In Proceedings of IEEE Conference on Decision and Control, pages 2289–2293, 2017.
-  L. Xiao, S. Boyd, and S. Lall. A scheme for robust distributed sensor fusion based on average consensus. In International Symposium on Information Processing in Sensor Networks, pages 63–70, 2005.
-  S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms. IEEE/ACM Transactions on Networking, 14(SI):2508–2530, 2006.
-  T.C. Aysal, M.E. Yildiz, A.D. Sarwate, and A. Scaglione. Broadcast gossip algorithms for consensus. IEEE Transactions on Signal Processing, 57(7):2748–2761, 2009.
-  D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computation of aggregate information. In Proceedings of IEEE Symposium on Foundations of Computer Science, pages 482–491, 2003.
-  A. Nedić and A. Olshevsky. Distributed optimization over time-varying directed graphs. IEEE Transactions on Automatic Control, 60(3):601–615, 2015.
-  P. Rezaienia, B. Gharesifard, T. Linder, and B. Touri. Convergence rate of push-sum algorithms on random graphs. In Proceedings of IEEE Conference on Decision and Control, pages 4218–4223, 2018.
-  J. Liu and A.S. Morse. Asynchronous distributed averaging using double linear iterations. In Proceedings of the American Control Conference, pages 6620–6625, 2012.
-  C.N. Hadjicostis and T. Charalambous. Average consensus in the presence of delays in directed graph topologies. IEEE Transactions on Automatic Control, 59(3):763–768, 2014.