Abstract
This paper considers a distributed reinforcement learning problem in which anetwork of multiple agents aim to cooperatively maximize the globally averagedreturn through communication with only local neighbors. A randomizedcommunicationefficient multiagent actorcritic algorithm is proposed forpossibly unidirectional communication relationships depicted by a directedgraph. It is shown that the algorithm can solve the problem for stronglyconnected graphs by allowing each agent to transmit only two scalarvaluedvariables at one time.
Quick Read (beta)
A CommunicationEfficient MultiAgent ActorCritic Algorithm for Distributed Reinforcement Learning
Abstract
This paper considers a distributed reinforcement learning problem in which a network of multiple agents aim to cooperatively maximize the globally averaged return through communication with only local neighbors. A randomized communicationefficient multiagent actorcritic algorithm is proposed for possibly unidirectional communication relationships depicted by a directed graph. It is shown that the algorithm can solve the problem for strongly connected graphs by allowing each agent to transmit only two scalarvalued variables at one time.
I. Introduction
Recently, there has been increasing interest in developing distributed machine learning algorithms. Notable examples include distributed linear regression [1], multiarm bandit [2], reinforcement learning (RL) [3], and deep learning [4]. Such algorithms have promising applications in largescale networks, such as social platforms, online economic networks, cyberphysical systems, and Internet of Things, primarily because in such a complex network, it is impossible to collect all the information at the same point and each component of the network may not be willing to share its private information due to privacy issues.
Multiagent reinforcement learning (MARL) problems have recently received increasing attention. In general, MARL problems are investigated in settings that are either collaborative, competitive, or a mixture of the two. For collaborative MARL, the most rudimentary framework is the canonical multiagent Markov decision process [5, 6], where the agents share a common reward function that is determined by the joint actions of all agents. Another notable framework for collaborative MARL is the team Markov game model, also with a shared reward function among agents [7, 8]. These two frameworks were then extended to the setting where agents are allowed to have heterogeneous reward functions [9, 3, 10, 11, 12], collaborating with the goal of maximizing the longterm return corresponding to the team averaged reward. In particular, these works focused on a fullydecentralized/distributed setting, where there exists no central controller to coordinate the agents to achieve the overall team goal. Instead, the agents are connected via a communication network and are only able to exchange information with the neighbors on the network. There is also an evergrowing number of works on MARL in competitive and mixed settings [13, 14, 15, 16], where most of the recent ones are empirical works without theoretical convergence guarantees. In this work, we focus on the decentralized/distributed and collaborative MARL setting with networked agents as in [9, 3, 17].
Within this setting, the work of [3] proposed the first fully distributed actorcritic algorithm. The algorithm allows the agents to exchange information over a communication network with possibly sparse connectivity at each agent, which improves the scalability of the multiagent model with a high population of agents and thus tackles one of the longstanding challenges in general MARL problems [18]. A detailed comparison of the problem setting to the other existing ones on multiagent and collaborative MARL is provided in [3].
A possible communication issue of the algorithms proposed in [3] is that they require each agent to transmit the entire vector of its estimate of $\omega $ to its neighbors at each time. Such communicationcostly algorithms may not be possible to secure in some learning applications, for example, when the size of $\omega $ is very large but each agent has limited communication capacity. In this paper, we propose an approach in which each agent broadcasts only one (scaled) entry of its estimate of $\omega $, thus significantly reducing communication costs at each iteration. The algorithms in [3] more or less rely on doubly stochastic matrices, which implicitly requires bidirectional communication between each pair of neighboring agents. This requirement restricts the applications of the algorithm in scenarios with possibly unidirectional communication. To get around this limitation, we propose a variant using the idea of pushsum [19] with which each agent only needs to transmit two scalarvalued variables at each time, and in particular, each agent can independently determine which entry of the estimate of $\omega $ to transmit. For simplicity, we take a randomized way and show that with this very cheap communication scheme, the algorithm can solve the distributed reinforcement learning problem for any strongly connected graph that depicts the communication relationships among the agents. Note that in parallel with this work, [20] appeared to be the first one that considers the communication efficiency in MARL and distributed RL in general. However, the setting in [20] assumes the existence of a central controller. In contrast, to the best of our knowledge, our work proposes the first communicationefficient MARL/distributed RL algorithm for networked agents.
The remaining part of this paper is organized as follows. Section II introduces the MARL problem and presents a communicationefficient variant of the algorithm in [3], which essentially requires bidirectional communication and transmission coordination between neighboring agents. Motivated by the restrictive requirements, Section III proposes a multiagent actorcritic algorithm for unidirectional communication, based on which, a modification is proposed in Section IV, yielding a communicationefficient algorithm over directed graphs without any transmission coordination among the agents. The paper ends with some concluding remarks in Section V, followed by an appendix.
II. Problem Formulation
In this section, we introduce the background and formulation of the MARL problem with networked agents.
A. Networked MultiAgent MDP
Consider a team of $N$ agents, denoted by $\mathcal{N}=\{1,2,\mathrm{\dots},N\}$, operating in a common environment. It is assumed that no central controller that can either collect rewards or make the decisions for the agents exists. In contrast, the agents are connected by a possibly timevarying communication network depicted by an undirected graph ${\mathcal{G}}_{t}=(\mathcal{N},{\mathcal{E}}_{t})$, where the set of communication links at time $t\in \mathbb{N}$ is denoted by ${\mathcal{E}}_{t}$. Then, a networked multiagent MDP model can be defined by a tuple $(\mathcal{S},{\{{\mathcal{A}}^{i}\}}_{i\in \mathcal{N}},P,{\{{R}^{i}\}}_{i\in \mathcal{N}},{\{{\mathcal{G}}_{t}\}}_{t\ge 0})$, where $\mathcal{S}$ is the state space shared by all the agents in $\mathcal{N}$, ${\mathcal{A}}^{i}$ is the action space of agent $i$, and ${\{{\mathcal{G}}_{t}\}}_{t\ge 0}$ is a sequence of timevarying communication networks. For each agent $i$, ${R}^{i}:\mathcal{S}\times \mathcal{A}\to \mathbb{R}$ is the local reward function, where $\mathcal{A}={\prod}_{i=1}^{N}{\mathcal{A}}^{i}$ is the joint action space. $P:\mathcal{S}\times \mathcal{A}\times \mathcal{S}\to [0,1]$ denotes the state transition probability of the MDP. It is assumed throughout the paper that the states are globally observable and but the rewards are observed only locally.
The networked multiagent MDP evolves as follows. Each agent $i$ chooses its own action ${a}_{t}^{i}$ given state ${s}_{t}$ at time $t$, according to a local policy, i.e., the probability of choosing action ${a}^{i}$ at state $s$, ${\pi}^{i}:\mathcal{S}\times {\mathcal{A}}^{i}\to [0,1]$. Note that the joint policy of all agents, $\pi :\mathcal{S}\times \mathcal{A}\to [0,1]$, satisfies $\pi (s,a)={\prod}_{i\in \mathcal{N}}{\pi}^{i}(s,{a}^{i})$. Also, a reward ${r}_{t+1}^{i}$ is received by agent $i$ after executing the action. To make the search of the optimal joint policy tractable, we assume that the local policy is parameterized by ${\pi}_{{\theta}^{i}}^{i}$, where ${\theta}^{i}\in {\mathrm{\Theta}}^{i}$ is the parameter, and ${\mathrm{\Theta}}^{i}\subseteq {\mathbb{R}}^{{m}_{i}}$ is a compact set. The parameters are concatenated as $\theta ={[{({\theta}^{1})}^{\top},\mathrm{\cdots},{({\theta}^{N})}^{\top}]}^{\top}\in \mathrm{\Theta}$, where $\mathrm{\Theta}={\prod}_{i=1}^{N}{\mathrm{\Theta}}^{i}$. The joint policy is thus given by ${\pi}_{\theta}(s,a)={\prod}_{i\in \mathcal{N}}{\pi}_{{\theta}^{i}}^{i}(s,{a}_{i}).$ We first make a standard regularity assumption on the model and the policy parameterization.
Assumption 1.
For any $i\mathrm{\in}\mathrm{N}$, $s\mathrm{\in}\mathrm{S}$, and ${a}^{i}\mathrm{\in}{\mathrm{A}}^{i}$, the policy function ${\pi}_{{\theta}^{i}}^{i}\mathit{}\mathrm{(}s\mathrm{,}{a}^{i}\mathrm{)}\mathrm{>}\mathrm{0}$ for any ${\theta}^{i}\mathrm{\in}{\mathrm{\Theta}}^{i}$. Also, ${\pi}_{{\theta}^{i}}^{i}\mathit{}\mathrm{(}s\mathrm{,}{a}^{i}\mathrm{)}$ is continuously differentiable with respect to the parameter ${\theta}^{i}$ over ${\mathrm{\Theta}}^{i}$. In addition, for any $\theta \mathrm{\in}\mathrm{\Theta}$, let ${P}^{\theta}$ be the transition matrix of the Markov chain ${\mathrm{\{}{s}_{t}\mathrm{\}}}_{t\mathrm{\ge}\mathrm{0}}$ induced by policy ${\pi}_{\theta}$, that is, for any $s\mathrm{,}{s}^{\mathrm{\prime}}\mathrm{\in}\mathrm{S}$
${P}^{\theta}({s}^{\prime}s)={\displaystyle \sum _{a\in \mathcal{A}}}{\pi}_{\theta}(s,a)\cdot P({s}^{\prime}s,a).$  (1) 
We assume that the Markov chain ${\{{s}_{t}\}}_{t\ge 0}$ is irreducible and aperiodic under any ${\pi}_{\theta}$, with the stationary distribution denoted by ${d}_{\theta}$.
Assumption 1 has been made in the existing work on actorcritic (AC) algorithms with function approximation [21, 22]. It implies that the Markov chain of the stateaction pair ${\{({s}_{t},{a}_{t})\}}_{t\ge 0}$ has a stationary distribution ${d}_{\theta}(s)\cdot {\pi}_{\theta}(s,a)$ for any $s\in \mathcal{S}$ and $a\in \mathcal{A}$.
The objective of the agents is to collaboratively find a policy ${\pi}_{\theta}$ that maximizes the globally averaged longterm return over the network based solely on local information, namely,
$\underset{\theta}{\mathrm{max}}J(\theta )=$  $\mathrm{}\underset{T}{lim}{\displaystyle \frac{1}{T}}\mathbb{E}\left({\displaystyle \sum _{t=0}^{T1}}{\displaystyle \frac{1}{N}}{\displaystyle \sum _{i\in \mathcal{N}}}{r}_{t+1}^{i}\right)$  
$=$  $\mathrm{}{\displaystyle \sum _{s\in \mathcal{S},a\in \mathcal{A}}}{d}_{\theta}(s){\pi}_{\theta}(s,a)\cdot \overline{R}(s,a),$  (2) 
where $\overline{R}(s,a)={N}^{1}\cdot {\sum}_{i\in \mathcal{N}}{R}^{i}(s,a)$ is the globally averaged reward function. Let ${\overline{r}}_{t}={N}^{1}\cdot {\sum}_{i\in \mathcal{N}}{r}_{t}^{i}$; then, we have $\overline{R}(s,a)=\mathbb{E}[{\overline{r}}_{t+1}{s}_{t}=s,{a}_{t}=a]$. Thus, the global relative actionvalue function under policy ${\pi}_{\theta}$ can be defined accordingly as
$${Q}_{\theta}(s,a)=\sum _{t}\mathbb{E}[{\overline{r}}_{t+1}J(\theta ){s}_{0}=s,{a}_{0}=a,{\pi}_{\theta}],$$ 
and the global relative statevalue function ${V}_{\theta}(s)$ is defined as ${V}_{\theta}(s)={\sum}_{a\in \mathcal{A}}{\pi}_{\theta}(s,a){Q}_{\theta}(s,a)$. For simplicity, hereafter we will refer to ${V}_{\theta}$ and ${Q}_{\theta}$ as statevalue function and actionvalue function only. Furthermore, the advantage function can be defined as ${A}_{\theta}(s,a)={Q}_{\theta}(s,a){V}_{\theta}(s)$.
As the basis for developing multiagent actorcritic algorithms, the following policy gradient theorem was established in [3] for MARL.
Policy Gradient Theorem for MARL: [Theorem $3.1$ in [3]] For any $\theta \in \mathrm{\Theta}$ and any agent $i\in \mathcal{N}$, we define the local advantage function ${A}_{\theta}^{i}:\mathcal{S}\times \mathcal{A}\to \mathbb{R}$ as
${A}_{\theta}^{i}(s,a)=$  $\mathrm{}{Q}_{\theta}(s,a){\stackrel{~}{V}}_{\theta}^{i}(s,{a}^{i}),$  (3) 
where ${\stackrel{~}{V}}_{\theta}^{i}(s,{a}^{i})={\sum}_{{a}^{i}\in {\mathcal{A}}^{i}}{\pi}_{{\theta}^{i}}^{i}(s,{a}^{i})\cdot {Q}_{\theta}(s,{a}^{i},{a}^{i})$. We use ${a}^{i}$ to denote the actions of all agents except for $i$. Then, the gradient of $J(\theta )$ with respect to ${\theta}^{i}$ is given by
${\nabla}_{{\theta}^{i}}J(\theta )$  $={\mathbb{E}}_{s\sim {d}_{\theta},a\sim {\pi}_{\theta}}\left[{\nabla}_{{\theta}^{i}}\mathrm{log}{\pi}_{{\theta}^{i}}^{i}(s,{a}^{i})\cdot {A}_{\theta}(s,a)\right]$  
$={\mathbb{E}}_{s\sim {d}_{\theta},a\sim {\pi}_{\theta}}\left[{\nabla}_{{\theta}^{i}}\mathrm{log}{\pi}_{{\theta}^{i}}^{i}(s,{a}^{i})\cdot {A}_{\theta}^{i}(s,a)\right].$  (4) 
B. A Motivating Algorithm
As mentioned earlier, the distributed actorcritic algorithms proposed in [3] require each agent to transmit its estimate of $\omega $ to all its neighbors at each time, which can be communicationexpensive when the size of $\omega $ is very large. A natural idea to reduce the communication cost at each time step is to allow each agent to transmit only partial entries of its estimate vector to its neighbors at one time. Such an idea has been explored in distributed algorithms for solving linear algebraic equations [23] and finding a common fixed point among a family of nonlinear maps [24].
We first propose a communicationefficient algorithm based on the algorithm in [3] and then show its limitation in implementation, which serves as a motivating algorithm for our main contribution in the next sections.
The algorithm is based on the local advantage function ${A}_{\theta}^{i}$ defined in (3), which requires estimating the actionvalue function ${Q}_{\theta}$ of policy ${\pi}_{\theta}$. Consider $Q(\cdot ,\cdot ;\omega ):\mathcal{S}\times \mathcal{A}\to \mathbb{R}$, a family of functions parametrized by $\omega \in {\mathbb{R}}^{K}$, where $K\ll \mathcal{S}\cdot \mathcal{A}$. It is assumed that each agent $i$ maintains its own parameter ${\omega}^{i}$ and uses $Q(\cdot ,\cdot ;{\omega}^{i})$ as a local estimate of ${Q}_{\theta}$.
The algorithm consists of two steps, the actor step and the critic step. In the critic step, an update based on temporal difference (TD) learning is performed at each agent to estimate $Q(\cdot ,\cdot ;{\omega}^{i})$, followed by a linear combination of its neighbors’ parameter estimates. Different from Algorithm 1 in [3] in which the parameter sharing step is the consensus update for all $N$ agents’ entire vectors of their estimates of $\omega $, the algorithm here allows each agent to randomly transmit some entries of its estimate vector. For simplicity, suppose that each agent randomly picks one entry of its estimate vector at each time and then transmits the entry to its neighbors. Then, the critic step involves a consensus update for each entry $k$ of the estimate vector, depending on which agents transmit the corresponding entry at time $t$, which involves a weight matrix ${C}_{t}^{k}={[{c}_{t}^{k}(i,j)]}_{N\times N}$, where ${c}_{t}^{k}(i,j)$ is the weight on the $k$th entry transmitted from agent $j$ to agent $i$ at time $t$. Specifically, the critic step iterates as follows:
${\mu}_{t+1}^{i}$  $=(1{\beta}_{\omega ,t})\cdot {\mu}_{t}^{i}+{\beta}_{\omega ,t}\cdot {r}_{t+1}^{i},$  
${\stackrel{~}{\omega}}_{t}^{i}$  $={\omega}_{t}^{i}+{\beta}_{\omega ,t}\cdot {\delta}_{t}^{i}\cdot {\nabla}_{\omega}{Q}_{t}({\omega}_{t}^{i}),$  (5)  
${\omega}_{t+1}^{ik}$  $={\displaystyle \sum _{j\in \mathcal{N}}}{c}_{t}^{k}(i,j)\cdot {\stackrel{~}{\omega}}_{t}^{jk},k\in \{1,2,\mathrm{\dots},K\},$ 
where ${\omega}_{t}^{ik}$ and ${\stackrel{~}{\omega}}_{t}^{ik}$ denote the $k$th entry of ${\omega}_{t}^{i}$ and ${\stackrel{~}{\omega}}_{t}^{i}$, respectively, ${\mu}_{t}^{i}$ tracks the longterm return of agent $i$, ${\beta}_{\omega ,t}>0$ is the stepsize, ${Q}_{t}(\omega )=Q({s}_{t},{a}_{t};\omega )$ for any $\omega $, and the local actionvalue TDerror ${\delta}_{t}^{i}$ in (5) is defined as
${\delta}_{t}^{i}={r}_{t+1}^{i}{\mu}_{t}^{i}+{Q}_{t+1}({\omega}_{t}^{i}){Q}_{t}({\omega}_{t}^{i}).$  (6) 
The actor step is motivated by (A.) and is the same as that of Algorithm 1 in [3].
In the special case when all weight matrices ${C}_{t}^{k}={[{c}_{t}^{k}(i,j)]}_{N\times N}$ are the same, i.e., ${C}_{t}^{k}={C}_{t}={[{c}_{t}(i,j)]}_{N\times N}$, $k\in \{1,\mathrm{\dots},K\}$, the algorithm simplifies to Algorithm 1 in [3], i.e., all agents’ estimate vectors are transmitted simultaneously at each time. For this case, it has been shown in [3] that the algorithm will guarantee the convergence upon the following assumption on the weight matrix ${C}_{t}$.
Assumption 2.
The sequence of random matrices ${\mathrm{\{}{C}_{t}\mathrm{\}}}_{t\mathrm{\ge}\mathrm{0}}$ satisfies the following conditions:

1.
${C}_{t}$ is row stochastic and $\mathbb{E}({C}_{t})$ is column stochastic.^{1}^{1} 1 A nonnegative matrix is called row (or column) stochastic if all its row (or column) sums equal to one. There exists a constant $\eta >0$ such that ${c}_{t}(i,j)\ge \eta $ for any ${c}_{t}(i,j)>0$.

2.
${C}_{t}$ respects the communication graph ${\mathcal{G}}_{t}$, i.e., ${c}_{t}(i,j)=0$, if $(j,i)\notin {\mathcal{E}}_{t}$.

3.
The spectral norm of $\mathbb{E}[{C}_{t}^{\top}\cdot (I{\mathrm{\U0001d7cf\U0001d7cf}}^{\top}/N)\cdot {C}_{t}]$ is strictly less than one.

4.
Given the $\sigma $algebra generated by the random variable before time $t$, ${C}_{t}$ is conditionally independent of ${r}_{t+1}^{i}$ for any $i\in \mathcal{N}$.
Here ${\mathrm{\U0001d7cf}}_{N}$ denotes the $N$dimensional column vector whose entries are all equal to one.
We now consider the heterogeneous ${C}_{t}^{k}$ case proposed in our algorithm. Let $\omega ={[{({\omega}^{1})}^{\top},\mathrm{\dots},{({\omega}^{N})}^{\top}]}^{\top}$ and $\stackrel{~}{\omega}={[{({\stackrel{~}{\omega}}^{1})}^{\top},\mathrm{\dots},{({\stackrel{~}{\omega}}^{N})}^{\top}]}^{\top}$. Then, from (5), it can be verified that
$${\omega}_{t+1}={\overline{C}}_{t}\cdot {\stackrel{~}{\omega}}_{t},$$ 
where ${\overline{C}}_{t}={\sum}_{k=1}^{K}{C}_{t}^{k}\otimes ({e}_{k}{e}_{k}^{\top})\in {\mathbb{R}}^{NK\times NK}$ and ${e}_{k}$ is $k$th unit vector. More can be said.
Proposition 1.
If ${C}_{t}^{k}$ satisfies Assumption 2 for all $k\mathrm{\in}\mathrm{\{}\mathrm{1}\mathrm{,}\mathrm{\dots}\mathrm{,}K\mathrm{\}}$, then ${\overline{C}}_{t}$ does as well.
The proof of this proposition can be found in the appendix.
The above proposition implies that the proposed communicationefficient algorithm can also guarantee the convergence if each entry’s weight matrix ${C}_{t}^{k}$ satisfies Assumption 2.^{2}^{2} 2 The proof is essentially the same as that in [3]. However, satisfying the assumption leads to restriction in implementation. Specifically, there only exist three distributed approaches to implement a weight matrix satisfying the assumption – the Metropolis algorithm [25], pairwise gossiping [26], and broadcast gossip [27]. However, the broadcast gossip cannot always guarantee average consensus and thus cannot be applied to the distributed RL problem considered here, and both the Metropolis algorithm and pairwise gossiping require bidirectional communication between each pair of neighboring agents. Thus, the implementation of the above communicationefficient algorithm requires bidirectional communication, i.e., the communication graph ${\mathcal{G}}_{t}$ must be undirected. Moreover, when an agent $i$ receives the $k$th entry from its neighbor $j$’s estimate, agent $i$ must send its $k$th entry to agent $j$ as well. Since each agent randomly picks one of its entries, in the worst scenario when each agent picks a distinct entry, the above requirement may lead all the agents to transmit $N$ entries at one time. To get around this limitation, we will propose a multiagent actorcritic algorithm for directed graphs in the next section, and then modify it to be communicationefficient in Section IV.
III. MultiAgent ActorCritic over Directed Graphs
In this section, we propose a multiagent actorcritic algorithm for directed communication graphs by exploiting the idea of the pushsum protocol [28]. For simplicity, we focus on fixed graphs which are assumed to be strongly connected. The algorithms and their convergence results can be extended to timevarying directed graphs with a certain mild condition on joint strong connectivity.
Let each agent $i$ have control over a scalarvalued variable ${y}_{t}^{i}$ whose initial value ${y}_{0}^{i}=1$. The critic step of the algorithm iterates as follows:
${\mu}_{t+1}^{i}$  $=(1{\beta}_{\omega ,t})\cdot {\mu}_{t}^{i}+{\beta}_{\omega ,t}\cdot {r}_{t+1}^{i},$  
${\stackrel{~}{\omega}}_{t}^{i}$  $={\omega}_{t}^{i}+{\beta}_{\omega ,t}\cdot {\stackrel{~}{\delta}}_{t}^{i}\cdot {\nabla}_{z}{Q}_{t}({z}_{t}^{i}),$  
${\omega}_{t+1}^{i}$  $={\displaystyle \sum _{j\in \mathcal{N}}}b(i,j)\cdot {\stackrel{~}{\omega}}_{t}^{j},$  
${y}_{t+1}^{i}$  $={\displaystyle \sum _{j\in \mathcal{N}}}b(i,j)\cdot {y}_{t}^{j},$  (7)  
${z}_{t+1}^{i}$  $={\displaystyle \frac{{\omega}_{t+1}^{i}}{{y}_{t+1}^{i}}},$ 
where ${\mu}_{t}^{i}$ tracks the longterm average return of agent $i$, ${\beta}_{\omega ,t}>0$ is the stepsize, ${Q}_{t}(z)$ denotes $Q({s}_{t},{a}_{t};z)$ for any $z$, and $b(i,j)$ will be specified shortly. The local actionvalue TDerror ${\stackrel{~}{\delta}}_{t}^{i}$ in (7) is given by
${\stackrel{~}{\delta}}_{t}^{i}={r}_{t+1}^{i}{\mu}_{t}^{i}+{Q}_{t+1}({z}_{t}^{i}){Q}_{t}({z}_{t}^{i}).$  (8) 
As for the actor step, each agent $i$ improves its policy via
${\theta}_{t+1}^{i}={\theta}_{t}^{i}+{\beta}_{\theta ,t}\cdot {A}_{t}^{i}\cdot {\psi}_{t}^{i},$  (9) 
where ${\beta}_{\theta ,t}>0$ is the stepsize, and ${A}_{t}^{i}$ and ${\psi}_{t}^{i}$ are defined as
${A}_{t}^{i}$  $={Q}_{t}({z}_{t}^{i}){\displaystyle \sum _{{a}^{i}\in {\mathcal{A}}^{i}}}{\pi}_{{\theta}_{t}^{i}}^{i}({s}_{t},{a}^{i})\cdot Q({s}_{t},{a}^{i},{a}_{t}^{i};{z}_{t}^{i}),$  
${\psi}_{t}^{i}$  $={\nabla}_{{\theta}^{i}}\mathrm{log}{\pi}_{{\theta}_{t}^{i}}^{i}({s}_{t},{a}_{t}^{i}).$  (10) 
Let $\mathcal{G}=(\mathcal{N},\mathcal{E})$ be the underlying communication graph. Then, $b(i,j)$ is given as follows: For all $i,j\in \mathcal{N}$,
$b(i,j)=$  $\mathrm{}{(1+{d}_{j})}^{1},\text{if}(j,i)\in \mathcal{E},$  
$b(i,j)=$  $\mathrm{}0,\text{if}(j,i)\notin \mathcal{E},$ 
where ${d}_{j}$ is the number of outgoing neighbors^{3}^{3} 3 Suppose that $(i,j)$ is a directed edge in graph $\mathcal{G}$. We say that agent $i$ is an inneighbor of agent $j$, and that agent $j$ is an outneighbor of agent $i$. of agent $j$, or equivalently, the outdegree of vertex $j$ in $\mathcal{G}$. Let $B$ be the matrix whose $ij$th entry is $b(i,j)$. Then, it is easy to see that

1.
$B$ is column stochastic, and there exists a constant $\eta >0$ such that $b(i,j)\ge \eta $ for any $b(i,j)>0$;

2.
$B$ respects the communication graph $\mathcal{G}$;

3.
Given the $\sigma $algebra generated by the random variable before time $t$, $B$ is conditionally independent of ${r}_{t+1}^{i}$ for any $i\in \mathcal{N}$.
We impose the following assumptions for the actorcritic algorithm which are either mild or standard; see [3] for detailed discussions on these assumptions.
Assumption 3.
The instantaneous reward ${r}_{t}^{i}$ is uniformly bounded for any $i\mathrm{\in}\mathrm{N}$ and $t\mathrm{\ge}\mathrm{0}$.
Assumption 4.
The stepsizes ${\beta}_{\omega \mathrm{,}t}$ and ${\beta}_{\theta \mathrm{,}t}$ satisfy
$\sum _{t}}{\beta}_{\omega ,t}={\displaystyle \sum _{t}}{\beta}_{\theta ,t}=\mathrm{\infty},$  
$$ 
In addition, ${\beta}_{\theta \mathrm{,}t}\mathrm{=}o\mathit{}\mathrm{(}{\beta}_{\omega \mathrm{,}t}\mathrm{)}$, and ${\mathrm{lim}}_{t}\mathit{}{\beta}_{\omega \mathrm{,}t\mathrm{+}\mathrm{1}}\mathrm{\cdot}{\beta}_{\omega \mathrm{,}t}^{\mathrm{}\mathrm{1}}\mathrm{=}\mathrm{1}$.
Assumption 5.
For each agent $i$, the function $Q\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{;}z\mathrm{)}$ is parametrized as $Q\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{;}z\mathrm{)}\mathrm{=}{z}^{\mathrm{\top}}\mathit{}\varphi \mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}$, where $\varphi \mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{=}{\mathrm{[}{\varphi}_{\mathrm{1}}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{,}\mathrm{\cdots}\mathrm{,}{\varphi}_{K}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{]}}^{\mathrm{\top}}\mathrm{\in}{\mathrm{R}}^{K}$ is the feature associated with $\mathrm{(}s\mathrm{,}a\mathrm{)}$. The feature vector $\varphi \mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}$ is uniformly bounded for any $s\mathrm{\in}\mathrm{S}\mathrm{,}a\mathrm{\in}\mathrm{A}$. Furthermore, the feature matrix $\mathrm{\Phi}\mathrm{\in}{\mathrm{R}}^{\mathrm{}\mathrm{S}\mathrm{}\mathrm{\cdot}\mathrm{}\mathrm{A}\mathrm{}\mathrm{\times}K}$ has full column rank, where the $k$th column of $\mathrm{\Phi}$ is ${\mathrm{[}{\varphi}_{k}\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{,}s\mathrm{\in}\mathrm{S}\mathrm{,}a\mathrm{\in}\mathrm{A}\mathrm{]}}^{\mathrm{\top}}$ for any $k\mathrm{\in}\mathrm{[}K\mathrm{]}$. Also, for any $u\mathrm{\in}{\mathrm{R}}^{K}$, $\mathrm{\Phi}\mathit{}u\mathrm{\ne}{\mathrm{1}}_{K}$.
Assumption 6.
The update of the policy parameter ${\theta}_{t}^{i}$ includes a local projection operator, ${\mathrm{\Gamma}}^{i}\mathrm{:}{\mathrm{R}}^{{m}_{i}}\mathrm{\to}{\mathrm{\Theta}}^{i}\mathrm{\subset}{\mathrm{R}}^{{m}_{i}}$, that projects any ${\theta}_{t}^{i}$ onto the compact set ${\mathrm{\Theta}}^{i}$. Also, we assume that $\mathrm{\Theta}\mathrm{=}{\mathrm{\prod}}_{i\mathrm{=}\mathrm{1}}^{N}{\mathrm{\Theta}}^{i}$ is large enough to include at least one local minimum of $J\mathit{}\mathrm{(}\theta \mathrm{)}$.
For simplicity, we define ${P}^{\theta}({s}^{\prime},{a}^{\prime}s,a)=P({s}^{\prime}s,a){\pi}_{\theta}({s}^{\prime},{a}^{\prime})$^{4}^{4} 4 With slight abuse of notation, the expression ${P}^{\theta}$ has the same form as the transition probability matrix of the Markov chain ${\{{s}_{t}\}}_{t\ge 0}$ under policy ${\pi}_{\theta}$ (see the definition in (1)). These two matrices can be easily differentiated by the context., ${\mathrm{D}}_{\theta}^{s,a}=\mathrm{diag}[{d}_{\theta}(s)\cdot {\pi}_{\theta}(s,a),s\in \mathcal{S},a\in \mathcal{A}]$, and $\overline{R}={[\overline{R}(s,a),s\in \mathcal{S},a\in \mathcal{A}]}^{\top}\in {\mathbb{R}}^{\mathcal{S}\cdot \mathcal{A}}$. We then define the operator ${T}_{\theta}^{Q}:{\mathbb{R}}^{\mathcal{S}\cdot \mathcal{A}}\to {\mathbb{R}}^{\mathcal{S}\cdot \mathcal{A}}$ for any actionvalue vector $Q\in {\mathbb{R}}^{\mathcal{S}\cdot \mathcal{A}}$ as
${T}_{\theta}^{Q}(Q)=\overline{R}J(\theta )\cdot {\mathrm{\U0001d7cf}}_{\mathcal{S}\cdot \mathcal{A}}+{P}^{\theta}Q.$  (11) 
We also define the vector ${\widehat{\mathrm{\Gamma}}}^{i}(\cdot )$ as
$$  (12) 
for any $\theta \in \mathrm{\Theta}$ and $g:\mathrm{\Theta}\to {\mathbb{R}}^{{\sum}_{i\in \mathcal{N}}{m}_{i}}$ a continuous function. In case the limit above is not unique, ${\widehat{\mathrm{\Gamma}}}^{i}[g(\theta )]$ is defined as the set of all possible limit points of (12).
With the above notation, we have the convergence of the critic step (7) – (8) and actor step (9) – (10) given policy ${\pi}_{\theta}$ as follows.
Theorem 1.
Suppose that Assumptions 1 and 35 hold, and that communication graph $\mathrm{G}$ is strongly connected. Then, for any given policy ${\pi}_{\theta}$, with the sequences $\mathrm{\{}{\mu}_{t}^{i}\mathrm{\}}$ and $\mathrm{\{}{z}_{t}^{i}\mathrm{\}}$ generated from (7) and (8), we have ${\mathrm{lim}}_{t}\mathit{}{\mathrm{\sum}}_{i\mathrm{\in}\mathrm{N}}{\mu}_{t}^{i}\mathrm{\cdot}{N}^{\mathrm{}\mathrm{1}}\mathrm{=}J\mathit{}\mathrm{(}\theta \mathrm{)}$ and ${\mathrm{lim}}_{t}\mathit{}{z}_{t}^{i}\mathrm{=}{\omega}_{\theta}$ almost surely for any $i\mathrm{\in}\mathrm{N}$, where $J\mathit{}\mathrm{(}\theta \mathrm{)}$ is the globally averaged return as defined in (A.), and ${\omega}_{\theta}$ is the unique solution to
$${\mathrm{\Phi}}^{\top}{\mathrm{D}}_{\theta}^{s,a}\left[{T}_{\theta}^{Q}(\mathrm{\Phi}{\omega}_{\theta})\mathrm{\Phi}{\omega}_{\theta}\right]=0.$$ 
Suppose further that Assumption 6 holds. Then, the sequence $\mathrm{\{}{\theta}_{t}^{i}\mathrm{\}}$ obtained from (9) converges almost surely to a point in the set of the asymptotically stable equilibria of
$${\dot{\theta}}^{i}={\widehat{\mathrm{\Gamma}}}^{i}\left[{\mathbb{E}}_{{s}_{t}\sim {d}_{\theta},{a}_{t}\sim {\pi}_{\theta}}\left({A}_{t,\theta}^{i}\cdot {\psi}_{t,\theta}^{i}\right)\right],\forall i\in \mathcal{N}.$$ 
The algorithm in this section works for directed communication graphs, but still requires each agent to transmit its entire estimate vector. In the next section, we will modify the algorithm to significantly reduce the communication cost at each time. The theorem just stated is a special case of the theorem in the next section.
IV. A CommunicationEfficient Algorithm over Directed Graphs
In this section, we present a communicationefficient distributed actorcritic algorithm, modified from the algorithm in the last section, in which each agent can independently transmit one scaled entry of its state vector, and in total only transmit two scalars at each iteration.
In our communicationefficient algorithm, each agent $i$ randomly picks one of its entries, $k$, of its estimate vector at time $t$ with probability ${p}_{t}^{ik}$, and then transmit its scaled value to its outneighbors. For simplicity, we assume that ${p}_{t}^{ik}={p}^{ik}$ for all $t$, and that ${p}^{ik}>0$ and ${\sum}_{k=1}^{K}{p}^{ik}=1$. For each entry $k$, each agent $i$ has control over a scalarvalued variable ${y}_{t}^{ik}$ whose initial value is ${y}_{0}^{ik}=1$.
The critic step iterates as follows:
${\mu}_{t+1}^{i}$  $=(1{\beta}_{\omega ,t})\cdot {\mu}_{t}^{i}+{\beta}_{\omega ,t}\cdot {r}_{t+1}^{i},$  
${\stackrel{~}{\omega}}_{t}^{i}$  $={\omega}_{t}^{i}+{\beta}_{\omega ,t}\cdot {\stackrel{~}{\delta}}_{t}^{i}\cdot {\nabla}_{z}{Q}_{t}({z}_{t}^{i}),$  
${\omega}_{t+1}^{ik}$  $={\displaystyle \sum _{j\in \mathcal{N}}}{b}_{t}^{k}(i,j)\cdot {\stackrel{~}{\omega}}_{t}^{jk},k\in \{1,2,\mathrm{\dots},K\},$  
${y}_{t+1}^{ik}$  $={\displaystyle \sum _{j\in \mathcal{N}}}{b}_{t}^{k}(i,j)\cdot {y}_{t}^{jk},k\in \{1,2,\mathrm{\dots},K\},$  (13)  
${z}_{t+1}^{ik}$  $={\displaystyle \frac{{\omega}_{t+1}^{ik}}{{y}_{t+1}^{ik}}},$ 
where ${\omega}_{t}^{ik}$ and ${\stackrel{~}{\omega}}_{t}^{ik}$ denote the $k$th entry of ${\omega}_{t}^{i}$ and ${\stackrel{~}{\omega}}_{t}^{i}$, respectively, ${\mu}_{t}^{i}$ tracks the longterm return of agent $i$, ${\beta}_{\omega ,t}>0$ is the stepsize, ${Q}_{t}(z)=Q({s}_{t},{a}_{t};z)$ for any $z$, and ${b}_{t}^{k}(i,j)$ will be specified shortly. The local actionvalue TDerror ${\stackrel{~}{\delta}}_{t}^{i}$ in (13) is given by
${\stackrel{~}{\delta}}_{t}^{i}={r}_{t+1}^{i}{\mu}_{t}^{i}+{Q}_{t+1}({z}_{t}^{i}){Q}_{t}({z}_{t}^{i}).$  (14) 
As for the actor step, each agent $i$ improves its policy via
${\theta}_{t+1}^{i}={\theta}_{t}^{i}+{\beta}_{\theta ,t}\cdot {A}_{t}^{i}\cdot {\psi}_{t}^{i},$  (15) 
where ${\beta}_{\theta ,t}>0$ is the stepsize. Moreover, ${A}_{t}^{i}$ and ${\psi}_{t}^{i}$ are defined as
${A}_{t}^{i}$  $={Q}_{t}({z}_{t}^{i}){\displaystyle \sum _{{a}^{i}\in {\mathcal{A}}^{i}}}{\pi}_{{\theta}_{t}^{i}}^{i}({s}_{t},{a}^{i})\cdot Q({s}_{t},{a}^{i},{a}_{t}^{i};{z}_{t}^{i}),$  
${\psi}_{t}^{i}$  $={\nabla}_{{\theta}^{i}}\mathrm{log}{\pi}_{{\theta}_{t}^{i}}^{i}({s}_{t},{a}_{t}^{i}).$ 
Let $\mathcal{G}=(\mathcal{N},\mathcal{E})$ be the underlying communication graph. Then, ${b}_{t}^{k}(i,j)$ is given as follows. For all $i,j\in \mathcal{N}$, ${b}_{t}^{k}(i,j)={(1+{d}_{j})}^{1}$ if $(j,i)\in \mathcal{E}$ and agent $i$ picks $k$th entry at time $t$; otherwise, ${b}_{t}^{k}(i,j)=0$, where ${d}_{j}$ is the number of outgoing neighbors of agent $j$, or equivalently, the outdegree of vertex $j$ in $\mathcal{G}$. Let ${B}_{t}^{k}$ be the matrix whose $ij$th entry is ${b}_{t}^{k}(i,j)$. Then, it is easy to see that

1.
Each ${B}_{t}^{k}$ is column stochastic, and there exists a constant $\eta >0$ such that ${b}_{t}^{k}(i,j)\ge \eta $ for any ${b}_{t}^{k}(i,j)>0$;

2.
Each ${B}_{t}^{k}$ respects the communication relationship, i.e., ${b}_{t}^{k}(i,j)=0$ if agent $i$ does not receive information from agent $j$ at time $t$;

3.
Given the $\sigma $algebra generated by the random variable before time $t$, each ${B}_{t}^{k}$ is conditionally independent of ${r}_{t+1}^{i}$ for any $i\in \mathcal{N}$.
It is worth emphasizing that in our communicationefficient algorithm, each agent $j$ only needs to transmit two scalars, $\frac{{\stackrel{~}{\omega}}_{t}^{jk}}{1+{d}_{j}}$ and $\frac{{y}_{t}^{jk}}{1+{d}_{j}}$, at each iteration, and all the computations are in a fully distributed manner.
Theorem 2.
Suppose that Assumptions 1 and 35 hold, and that communication graph $\mathrm{G}$ is strongly connected. Then, for any given policy ${\pi}_{\theta}$, with the sequences $\mathrm{\{}{z}_{t}^{i}\mathrm{\}}$ and $\mathrm{\{}{\mu}_{t}^{i}\mathrm{\}}$ generated from (13) and (14), we have ${\mathrm{lim}}_{t}\mathit{}{\mathrm{\sum}}_{i\mathrm{\in}\mathrm{N}}{\mu}_{t}^{i}\mathrm{\cdot}{N}^{\mathrm{}\mathrm{1}}\mathrm{=}J\mathit{}\mathrm{(}\theta \mathrm{)}$ and ${\mathrm{lim}}_{t}\mathit{}{z}_{t}^{i}\mathrm{=}{\omega}_{\theta}$ almost surely for any $i\mathrm{\in}\mathrm{N}$, where $J\mathit{}\mathrm{(}\theta \mathrm{)}$ is the globally averaged return as defined in (A.), and ${\omega}_{\theta}$ is the unique solution to
$${\mathrm{\Phi}}^{\top}{\mathrm{D}}_{\theta}^{s,a}\left[{T}_{\theta}^{Q}(\mathrm{\Phi}{\omega}_{\theta})\mathrm{\Phi}{\omega}_{\theta}\right]=0.$$ 
Suppose further that Assumption 6 holds. Then, the sequence $\mathrm{\{}{\theta}_{t}^{i}\mathrm{\}}$ obtained from (15) converges almost surely to a point in the set of the asymptotically stable equilibria of
$${\dot{\theta}}^{i}={\widehat{\mathrm{\Gamma}}}^{i}\left[{\mathbb{E}}_{{s}_{t}\sim {d}_{\theta},{a}_{t}\sim {\pi}_{\theta}}\left({A}_{t,\theta}^{i}\cdot {\psi}_{t,\theta}^{i}\right)\right],\forall i\in \mathcal{N}.$$ 
To prove the above theorem, we need the following concepts and results.
Define the operator $\u27e8\cdot \u27e9:{\mathbb{R}}^{KN}\to {\mathbb{R}}^{K}$ as
$$\u27e8x\u27e9=\frac{1}{N}({\mathrm{\U0001d7cf}}_{N}^{\top}\otimes {I}_{K})x=\frac{1}{N}\sum _{i\in \mathcal{N}}{x}^{i}$$ 
for any $x={[{({x}^{1})}^{\top},\mathrm{\dots},{({x}^{N})}^{\top}]}^{\top}\in {\mathbb{R}}^{KN}$ with ${x}^{i}\in {\mathbb{R}}^{K}$ for all $i\in \mathcal{N}$.
Lemma 1.
For all $i\mathrm{\in}\mathrm{N}$, ${\mathrm{lim}}_{t\mathrm{\to}\mathrm{\infty}}\mathit{}{z}_{t}^{i}\mathrm{=}{\mathrm{lim}}_{t\mathrm{\to}\mathrm{\infty}}\mathit{}\mathrm{\u27e8}{\omega}_{t}\mathrm{\u27e9}\mathrm{=}{\mathrm{lim}}_{t\mathrm{\to}\mathrm{\infty}}\mathit{}\mathrm{\u27e8}{z}_{t}\mathrm{\u27e9}$.
Proof: From the update of each agent $i$ in (13),
${z}_{t+1}^{ik}$  $={\displaystyle \frac{{\omega}_{t+1}^{ik}}{{y}_{t+1}^{ik}}}$  
$={\displaystyle \frac{{\sum}_{j\in \mathcal{N}}{b}_{t}^{k}(i,j)\cdot {\omega}_{t}^{jk}+{b}_{t}^{k}(i,j)\cdot {\beta}_{\omega ,t}\cdot {\stackrel{~}{u}}_{t+1}^{jk}}{{\sum}_{j\in \mathcal{N}}{b}_{t}^{k}(i,j)\cdot {y}_{t}^{jk}}}$  
$={\displaystyle \sum _{j\in \mathcal{N}}}{\displaystyle \frac{{b}_{t}^{k}(i,j)\cdot {\omega}_{t}^{jk}+{b}_{t}^{k}(i,j)\cdot {\beta}_{\omega ,t}\cdot {\stackrel{~}{u}}_{t+1}^{jk}}{{\sum}_{l\in \mathcal{N}}{b}_{t}^{k}(i,l)\cdot {y}_{t}^{lk}}}$  
$={\displaystyle \sum _{j\in \mathcal{N}}}{\displaystyle \frac{\left({b}_{t}^{k}(i,j)\cdot {\omega}_{t}^{jk}+{b}_{t}^{k}(i,j)\cdot {\beta}_{\omega ,t}\cdot {\stackrel{~}{u}}_{t+1}^{jk}\right)/\left({b}_{t}^{k}(i,j)\cdot {y}_{t}^{jk}\right)}{1+{\sum}_{l\in \mathcal{N},l\ne j}{b}_{t}^{k}(i,l)\cdot {y}_{t}^{lk}/\left({b}_{t}(i,j)\cdot {y}_{t}^{jk}\right)}}$  
$={\displaystyle \sum _{j\in \mathcal{N}}}{\displaystyle \frac{{z}_{t}^{jk}+{\beta}_{\omega ,t}\cdot {\stackrel{~}{u}}_{t+1}^{jk}/{y}_{t}^{jk}}{1+{\sum}_{l\in \mathcal{N},l\ne j}{b}_{t}^{k}(i,l)\cdot {y}_{t}^{lk}/\left({b}_{t}^{k}(i,j)\cdot {y}_{t}^{jk}\right)}}$  
$={\displaystyle \sum _{j\in \mathcal{N}}}{s}_{t}^{k}(i,j)\cdot \left({z}_{t}^{jk}+{\beta}_{\omega ,t}\cdot {\u03f5}_{t}^{jk}\right),$ 
where
$${s}_{t}^{k}(i,j)={\left(1+\sum _{l\in \mathcal{N},l\ne j}{b}_{t}^{k}(i,l)\cdot {y}_{t}^{lk}/({b}_{t}^{k}(i,j)\cdot {y}_{t}^{jk})\right)}^{1},$$ 
${\stackrel{~}{u}}_{t+1}^{i}={\stackrel{~}{\delta}}_{t}^{i}\cdot {\nabla}_{z}Q({z}_{t}^{i})$, and ${\u03f5}_{t}^{ik}={\stackrel{~}{u}}_{t+1}^{ik}/{y}_{t}^{ik}$. Let ${S}_{t}={\sum}_{k=1}^{K}{S}_{t}^{k}\otimes ({e}_{k}{e}_{k}^{\top})$, ${z}_{t}={[{({z}_{t}^{1})}^{\top},\mathrm{\dots},{({z}_{t}^{N})}^{\top}]}^{\top}$, and ${\u03f5}_{t}={[{({\u03f5}_{t}^{1})}^{\top},\mathrm{\dots},{({\u03f5}_{t}^{N})}^{\top}]}^{\top}$, where ${S}_{t}^{k}={[{s}_{t}^{k}(i,j)]}_{N\times N}$, ${z}_{t}^{i}={[{z}_{t}^{i1},\mathrm{\dots},{z}_{t}^{ik}]}^{\top}$, and ${\u03f5}_{t}^{i}={[{\u03f5}_{t}^{i1},\mathrm{\dots},{\u03f5}_{t}^{iK}]}^{\top}$. Then, we have ${S}_{t}\cdot \mathrm{\U0001d7cf}=\mathrm{\U0001d7cf}$, and
$${z}_{t+1}={S}_{t}\cdot ({z}_{t}+{\beta}_{\omega ,t}\cdot {\u03f5}_{t}).$$  (16) 
From Lemma 1 (b) in [29], we know that ${lim}_{t\to \mathrm{\infty}}{z}_{t}^{i}={lim}_{t\to \mathrm{\infty}}\u27e8{\stackrel{~}{\omega}}_{t}\u27e9$. Since ${lim}_{t\to \mathrm{\infty}}{\beta}_{\omega ,t}\cdot {\delta}_{t}^{i}\cdot {\nabla}_{z}Q({z}_{t}^{i})={lim}_{t\to \mathrm{\infty}}{\beta}_{\omega ,t}\cdot {\delta}_{t}^{i}\cdot \varphi ({s}_{t},{a}_{t})=0$, we have ${lim}_{t\to \mathrm{\infty}}{\stackrel{~}{\omega}}_{t}^{i}={lim}_{t\to \mathrm{\infty}}{\omega}_{t}^{i}$. Thus, ${lim}_{t\to \mathrm{\infty}}{z}_{t}^{i}={lim}_{t\to \mathrm{\infty}}\u27e8{\stackrel{~}{\omega}}_{t}\u27e9={lim}_{t\to \mathrm{\infty}}\u27e8{\omega}_{t}\u27e9$, and ${lim}_{t\to \mathrm{\infty}}(\u27e8{z}_{t}\u27e9\u27e8{\omega}_{t}\u27e9)=0$.
Let ${y}_{t}={[{({y}_{t}^{1})}^{\top},\mathrm{\dots},{({y}_{t}^{N})}^{\top}]}^{\top}$, where ${y}_{t}^{i}={[{y}_{t}^{i1},\mathrm{\dots},{y}_{t}^{iK}]}^{\top}$. Then, we have ${y}_{t+1}={B}_{t}{y}_{t}$.
Lemma 2.
Suppose that communication graph $\mathrm{G}$ is strongly connected. There exists a constant $\alpha \mathrm{>}\mathrm{0}$ such that $\alpha \mathrm{\le}{y}_{t}^{i\mathit{}k}\mathrm{\le}N$ for any $i\mathrm{,}k\mathrm{,}t$ almost surely.
Proof: The existence of a uniform lower bound of ${y}_{t}^{ik}$ is a consequence of Lemma 3 in [30]. Since ${y}_{0}^{ik}=1$ for all $i,k$ and each ${B}_{t}^{k}$ is column stochastic, ${\sum}_{i=1}^{N}{y}_{t}^{ik}={\sum}_{i=1}^{N}{y}_{0}^{ik}=N$ for all $t$. It follows that $\alpha \le {y}_{t}^{ik}\le N$ for any $i,k,t$.
Lemma 3.
Proof: The proof of the lemma is the same as that of Lemma 5.2 in [3].
Lemma 4.
Proof: Recall that the update of $z$ is ${z}_{t+1}={S}_{t}\cdot ({z}_{t}+{\beta}_{\omega ,t}\cdot {\u03f5}_{t})$ given in (16). Let ${h}^{i}({z}_{t}^{i},{\mu}_{t}^{i},{y}_{t}^{i},{s}_{t},{a}_{t})=\mathbb{E}({\u03f5}_{t}^{i}{\mathcal{F}}_{t,1})$, ${M}_{t+1}^{i}={\u03f5}_{t}^{i}\mathbb{E}({\u03f5}_{t}^{i}{\mathcal{F}}_{t,1})$. Since the Markov chain ${\{({s}_{t},{a}_{t})\}}_{t\ge 0}$ is irreducible and aperidic given policy ${\pi}_{\theta}$, we have ${\overline{h}}^{i}({\omega}_{t}^{i},{\mu}_{t}^{i},{y}_{t}^{i})={\mathbb{E}}_{{s}_{t}\sim {d}_{\theta},{a}_{t}\sim {\pi}_{\theta}}[{h}^{i}({z}_{t}^{i},{\mu}_{t}^{i},{y}_{t}^{i},{s}_{t},{a}_{t})]={\mathrm{\Phi}}^{\top}{D}_{\theta}^{s,a}[{\stackrel{~}{R}}^{i}{\mathrm{\U0001d7cf}}_{N}\otimes {\stackrel{~}{\mu}}_{t}^{i}+({P}^{\theta}\mathrm{\Phi}\mathrm{\Phi}){\stackrel{~}{z}}_{t}^{i}]$, where ${\stackrel{~}{R}}^{i}={[\frac{{R}^{i1}}{{y}_{t}^{i1}},\mathrm{\dots},\frac{{R}^{iK}}{{y}_{t}^{iK}},\mathrm{\dots},\frac{{R}^{i((N1)K+1)}}{{y}_{t}^{i1}},\mathrm{\dots},\frac{{R}^{i(NK)}}{{y}_{t}^{iK}}]}^{\top}$, ${\stackrel{~}{\mu}}_{t}^{i}={[\frac{{\mu}_{t}^{i}}{{y}_{t}^{i1}},\mathrm{\dots},\frac{{\mu}_{t}^{i}}{{y}_{t}^{iK}}]}^{\top}$ and ${\stackrel{~}{z}}_{t}^{i}={[\frac{{z}_{t}^{i1}}{{y}_{t}^{i1}},\mathrm{\dots},\frac{{z}_{t}^{iK}}{{y}_{t}^{iK}}]}^{\top}$.
From Assumptions 3 and 5, and Lemmas 2 and 3, we know that $\exists {K}_{1},{K}_{2}>0$, s.t. ${\parallel \frac{{\varphi}_{t}^{k}}{{y}_{t}^{ik}}\parallel}_{\mathrm{\infty}}\le {K}_{1}$ and $\parallel {r}_{t+1}^{i}{\mu}_{t}^{i}\parallel \le {K}_{2},\forall k,i$. Thus, $\exists {K}_{3}>0$ such that ${\parallel {\overline{h}}^{i}({\omega}_{t}^{i},{\mu}_{t}^{i},{y}_{t}^{i}){h}^{i}({z}_{t}^{i},{\mu}_{t}^{i},{y}_{t}^{i},{s}_{t},{a}_{t})\parallel}^{2}\le {K}_{3}\cdot (1+{\parallel {\omega}_{t}\parallel}^{2})$. Moreover, we know ${h}^{i}({z}_{t}^{i},{\mu}_{t}^{i},{y}_{t}^{i},{s}_{t},{a}_{t})$ is Lipschitz continuous in ${z}_{t}^{i}$, and ${M}_{t+1}^{i}$ is martingale difference sequence. Since each ${B}_{t}^{k}$ is column stochastic, it has bounded norm. Thus, by Theorem A.2 in [3], it follows that ${z}_{t}^{i}$ is bounded almost surely.
Lemma 5.
Proof: From (13), we know that for each entry $k$ in ${\omega}_{t}$, ${\omega}_{t}^{k}={z}_{t}^{k}\cdot {y}_{t}^{k}$, $k\in \{1,\mathrm{\dots},NK\}$. Moreover, from Lemmas 2 and 4, ${z}_{t}$ and ${y}_{t}$ are bounded almost surely. Therefore, it is easy to show that ${\omega}_{t}$ is also bounded almost surely.
We are now in a position to prove Theorem 2.
Proof of Theorem 2: Let $\{{\mathcal{F}}_{t,1}\}$ be the filtration with $$. Let ${v}_{t}^{i}={[{\mu}_{t}^{i},{({\omega}_{t}^{i})}^{\top}]}^{\top}$ and ${v}_{t}={[{({v}_{t}^{1})}^{\top},\mathrm{\dots},{({v}_{t}^{N})}^{\top}]}^{\top}$. Then, the iteration of $\u27e8{\omega}_{t}\u27e9$ has the following form:
$\u27e8{\omega}_{t+1}\u27e9$  $={\displaystyle \frac{1}{N}}({\mathrm{\U0001d7cf}}_{N}^{\top}\otimes {I}_{K}){B}_{t}({\omega}_{t}+{\beta}_{\omega ,t}{\stackrel{~}{u}}_{t+1})$  
$={\displaystyle \frac{1}{N}}({\mathrm{\U0001d7cf}}_{N}^{\top}\otimes {I}_{K})({\omega}_{t}+{\beta}_{\omega ,t}{\stackrel{~}{u}}_{t+1})$  
$=\u27e8{\omega}_{t}+{\beta}_{\omega ,t}{\stackrel{~}{u}}_{t+1}\u27e9$  
$=\u27e8{\omega}_{t}\u27e9+{\beta}_{\omega ,t}\u27e8{\stackrel{~}{u}}_{t+1}\u27e9$  
$=\u27e8{\omega}_{t}\u27e9+{\beta}_{\omega ,t}\u27e8{\stackrel{~}{\delta}}_{t}\u27e9\cdot {\varphi}_{t}.$ 
Hence, the updates for $\u27e8{\omega}_{t}\u27e9$ and $\u27e8{\mu}_{t}\u27e9$ are
$\u27e8{\mu}_{t+1}\u27e9$  $=\u27e8{\mu}_{t}\u27e9+{\beta}_{\omega ,t}\cdot \mathbb{E}({\overline{r}}_{t+1}\u27e8{\mu}_{t}\u27e9{\mathcal{F}}_{t,1})+{\beta}_{\omega ,t}\cdot {\xi}_{t+1,1},$  (17)  
$\u27e8{\omega}_{t+1}\u27e9$  $=\u27e8{\omega}_{t}\u27e9+{\beta}_{\omega ,t}\cdot \mathbb{E}(\u27e8{\delta}_{t}\u27e9{\varphi}_{t}{\mathcal{F}}_{t,1})+{\beta}_{\omega ,t}\cdot {\xi}_{t+1,2}+{\beta}_{\omega ,t}\cdot {\gamma}_{t+1},$  (18) 
where ${\xi}_{t+1,1}={\overline{r}}_{t+1}\mathbb{E}({\overline{r}}_{t+1}\u27e8{\mu}_{t}\u27e9{\mathcal{F}}_{t,1})$, ${\xi}_{t+1,2}=\u27e8{\delta}_{t}\u27e9{\varphi}_{t}\mathbb{E}(\u27e8{\delta}_{t}\u27e9{\varphi}_{t}{\mathcal{F}}_{t,1})$, and ${\gamma}_{t+1}=\u27e8{\stackrel{~}{\delta}}_{t}\u27e9{\varphi}_{t}\u27e8{\delta}_{t}\u27e9{\varphi}_{t}$.
Note that $\mathbb{E}({\overline{r}}_{t+1}\u27e8{\mu}_{t}\u27e9{\mathcal{F}}_{t,1})$ is Lipschitz continuous in $\u27e8{\mu}_{t}\u27e9$, and that $\mathbb{E}(\u27e8{\delta}_{t}\u27e9{\varphi}_{t}{\mathcal{F}}_{t,1})$ is Lipschitz continuous in both $\u27e8{\omega}_{t}\u27e9$ and $\u27e8{\mu}_{t}\u27e9$. Moreover, ${\xi}_{t+1,1}$ and ${\xi}_{t+1,2}$ are martingale differences sequences. From Lemmas 2 and 5, $\{{\gamma}_{t}\}$ is a bounded random sequence with ${\gamma}_{t}\to 0$ as $t\to \mathrm{\infty}$ almost surely.
From Theorem B.2 in [3], the following ODE captures the asymptotic behavior of (17) and (18):
$\u27e8\dot{v}\u27e9$  $=\left[\begin{array}{cc}\u27e8\dot{\mu}\u27e9\hfill & \\ \u27e8\dot{\omega}\u27e9\hfill & \end{array}\right]$  
$=\left[\begin{array}{cc}\hfill 1\hfill & \hfill 0\hfill \\ \hfill {\mathrm{\Phi}}^{\top}{D}_{\theta}^{s,a}{\mathrm{\U0001d7cf}}_{NK}\hfill & \hfill {\mathrm{\Phi}}^{\top}{D}_{\theta}^{s,a}({P}^{\theta}{I}_{NK})\mathrm{\Phi}\hfill \end{array}\right]\left[\begin{array}{cc}\u27e8\mu \u27e9\hfill & \\ \u27e8\omega \u27e9\hfill & \end{array}\right]+\left[\begin{array}{cc}J(\theta )\hfill & \\ {\mathrm{\Phi}}^{\top}{D}_{\theta}^{s,a}\overline{R}\hfill & \end{array}\right]$  (19) 
where ${D}_{\theta}^{s,a}=\mathrm{diag}[{d}_{\theta}(s)\cdot {\pi}_{\theta}(s,a),s\in \mathcal{S},a\in \mathcal{A}].$ From the proof of Theorem 4.6 in [3], the ODE (19) is globally asymptotically stable and has its equilibrium satisfying
$$\{\begin{array}{c}\u27e8\mu \u27e9=J(\theta )\hfill \\ {\mathrm{\Phi}}^{\top}{D}_{\theta}^{s,a}[\overline{R}\u27e8\mu \u27e9{\mathrm{\U0001d7cf}}_{NK}+{P}^{\theta}\mathrm{\Phi}\u27e8\omega \u27e9\mathrm{\Phi}\u27e8\omega \u27e9]=0.\hfill \end{array}$$  (20) 
Note that the solution for $\u27e8\mu \u27e9$ at equilibrium is $J(\theta )$, and that the solution for $\u27e8\omega \u27e9$ has the form ${\omega}_{\theta}+lv$ with any $l\in \mathbb{R}$ and $v\in {\mathbb{R}}^{K}$ such that $\varphi v={\mathrm{\U0001d7cf}}_{K}$, where ${\omega}_{\theta}$ follows that ${\mathrm{\Phi}}^{\top}{\mathrm{D}}_{\theta}^{s,a}\left[{T}_{\theta}^{Q}(\mathrm{\Phi}{\omega}_{\theta})\mathrm{\Phi}{\omega}_{\theta}\right]=0$. Moreover, $\varphi v\ne {\mathrm{\U0001d7cf}}_{K}$ by Assumption 5, so ${\omega}_{\theta}$ is the unique solution, which implies that ${lim}_{t}\u27e8{\omega}_{t}\u27e9={\omega}_{\theta}$. Combining the above facts with Lemma 1, we conclude that ${lim}_{t}{z}_{t}^{i}={\omega}_{\theta}$.
As for the actor step convergence, the proof is the same as that of Theorem 4.7 in [3].
V. Conclusions
This paper has proposed a communicationefficient distributed reinforcement learning algorithm. We have shown that the algorithm allows each agent to only transmit two scalarvalued variables, one of which is an independently selected entry of the agent’s estimate vector, at each time, and works for any strongly connected graph, which significantly reduces communication cost at one time compared with the algorithms in [3]. It is fairly straightforward to extend the algorithm and its convergence result to the case where each agent transmits more than one entry of its estimate vector. It is also fairly straightforward to extend the proposed algorithm to an asynchronous case without communication delays as was done in [31]. In the case when communication delays are taken into account, a modified version of the algorithm here is expected to solve the problem using the idea in [32]. Future directions of this work include comparison of total communication cost with the algorithms in [3], extension to timevarying communication graphs, and development of asynchronous versions of the algorithm.
VI. Appendix
Proof of Proposition 1: (1) Since all ${C}_{t}^{k}(t)$ satisfy condition 1) in Assumption 2, it follows that
${\overline{C}}_{t}\cdot {\mathrm{\U0001d7cf}}_{NK}$  $={\displaystyle \sum _{k=1}^{K}}({C}_{t}^{k}\otimes ({e}_{k}{e}_{k}^{\top}))\cdot ({\mathrm{\U0001d7cf}}_{N}\otimes {\mathrm{\U0001d7cf}}_{K})$  
$={\displaystyle \sum _{k=1}^{K}}({C}_{t}^{k}\cdot {\mathrm{\U0001d7cf}}_{N})\otimes ({e}_{k}{e}_{k}^{\top}\cdot {\mathrm{\U0001d7cf}}_{K})={\mathrm{\U0001d7cf}}_{NK},$  
${\mathrm{\U0001d7cf}}_{NK}^{\top}\mathbb{E}[{\overline{C}}_{t}]$  $={\displaystyle \sum _{k=1}^{K}}({\mathrm{\U0001d7cf}}_{N}^{\top}\otimes {\mathrm{\U0001d7cf}}_{K}^{\top})(\mathbb{E}[{C}_{t}^{k}]\otimes ({e}_{k}{e}_{k}^{\top}))$  
$={\displaystyle \sum _{k=1}^{K}}({\mathrm{\U0001d7cf}}_{N}^{\top}\cdot \mathbb{E}[{C}_{t}^{k}])\otimes ({\mathrm{\U0001d7cf}}_{K}^{\top}\cdot ({e}_{k}{e}_{k}^{\top}))={\mathrm{\U0001d7cf}}_{NK}^{\top}.$ 
From the definition of ${\overline{C}}_{t}$, for any entry $(i,j)$ of ${C}_{t}^{k}$, ${\overline{c}}_{t}((i1)K+k,(j1)K+k)={c}_{t}^{k}(i,j)$. Since for any ${c}_{t}^{k}(i,j)>0$, ${c}_{t}^{k}(i,j)\ge \eta ,\forall k\in \{1,\mathrm{\dots},K\},i,j\in \{1,\mathrm{\dots},N\}$, each entry of ${\overline{C}}_{t}$ also satisfies condition 1).
(2) From the definition of ${\overline{C}}_{t}$, since all ${C}_{t}^{k}$ satisfy condition 2) in Assumption 2, so does ${\overline{C}}_{t}$.
(3) First note that
$\mathbb{E}[{\overline{C}}_{t}^{\top}\cdot ({I}_{N}{\mathrm{\U0001d7cf}}_{N}{\mathrm{\U0001d7cf}}_{N}^{\top}/N)\otimes {I}_{K}\cdot {\overline{C}}_{t}]$  
$=\mathbb{E}[{({\displaystyle \sum _{k=1}^{K}}{C}_{t}^{k}\otimes ({e}_{k}{e}_{k}^{\top}))}^{\top}\cdot ({I}_{N}{\mathrm{\U0001d7cf}}_{N}{\mathrm{\U0001d7cf}}_{N}^{\top}/N)\otimes ({\displaystyle \sum _{k=1}^{K}}{e}_{k}{e}_{k}^{\top})\cdot ({\displaystyle \sum _{k=1}^{K}}{C}_{t}^{k}\otimes ({e}_{k}{e}_{k}^{\top}))]$  
$=\mathbb{E}[({\displaystyle \sum _{k=1}^{K}}{({C}_{t}^{k})}^{\top}\otimes ({e}_{k}{e}_{k}^{\top}))\cdot ({\displaystyle \sum _{k=1}^{K}}({I}_{N}{\mathrm{\U0001d7cf}}_{N}{\mathrm{\U0001d7cf}}_{N}^{\top}/N)\otimes ({e}_{k}{e}_{k}^{\top}))\cdot ({\displaystyle \sum _{k=1}^{K}}{C}_{t}^{k}\otimes ({e}_{k}{e}_{k}^{\top}))]$  
$=\mathbb{E}[({\displaystyle \sum _{k=1}^{K}}({({C}_{t}^{k})}^{\top}\cdot ({I}_{N}{\mathrm{\U0001d7cf}}_{N}{\mathrm{\U0001d7cf}}_{N}^{\top}/N)\cdot {C}_{t}^{k})\otimes ({e}_{k}{e}_{k}^{\top}))].$ 
There exists a permutation matrix $D$ such that
${D}^{\top}\cdot \mathbb{E}[({\displaystyle \sum _{k=1}^{K}}({({C}_{t}^{k})}^{\top}\cdot ({I}_{N}{\mathrm{\U0001d7cf}}_{N}{\mathrm{\U0001d7cf}}_{N}^{\top}/N)\cdot {C}_{t}^{k})\otimes ({e}_{k}{e}_{k}^{\top}))]\cdot D$  
$=\mathbb{E}[\mathrm{diag}\{({({C}_{t}^{1})}^{\top}({I}_{N}{\mathrm{\U0001d7cf}}_{N}{\mathrm{\U0001d7cf}}_{N}^{\top}/N){C}_{t}^{1}),\mathrm{\dots},({({C}_{t}^{K})}^{\top}({I}_{N}{\mathrm{\U0001d7cf}}_{N}{\mathrm{\U0001d7cf}}_{N}^{\top}/N){C}_{t}^{K})\}].$ 
Thus, the spectral norm of $\mathbb{E}[({\sum}_{k=1}^{K}({({C}_{t}^{k})}^{\top}\cdot ({I}_{N}{\mathrm{\U0001d7cf}}_{N}{\mathrm{\U0001d7cf}}_{N}^{\top}/N)\cdot {C}_{t}^{k})\otimes ({e}_{k}{e}_{k}^{\top}))]$ is the same as that of $\mathbb{E}[\mathrm{diag}\{({({C}_{t}^{1})}^{\top}({I}_{N}{\mathrm{\U0001d7cf}}_{N}{\mathrm{\U0001d7cf}}_{N}^{\top}/N){C}_{t}^{1}),\mathrm{\dots},({({C}_{t}^{K})}^{\top}({I}_{N}{\mathrm{\U0001d7cf}}_{N}{\mathrm{\U0001d7cf}}_{N}^{\top}/N){C}_{t}^{K})\}]$.
Since the spectral norm of
$\mathbb{E}[{({C}_{t}^{k})}^{\top}({I}_{N}{\mathrm{\U0001d7cf}}_{N}{\mathrm{\U0001d7cf}}_{N}^{\top}/N){C}_{t}^{k})]$ is strictly less than one for all $k$, so is the spectral norm of
$$\mathbb{E}[\mathrm{diag}\{({({C}_{t}^{1})}^{\top}({I}_{N}{\mathrm{\U0001d7cf}}_{N}{\mathrm{\U0001d7cf}}_{N}^{\top}/N){C}_{t}^{1}),\mathrm{\dots},({({C}_{t}^{K})}^{\top}({I}_{N}{\mathrm{\U0001d7cf}}_{N}{\mathrm{\U0001d7cf}}_{N}^{\top}/N){C}_{t}^{K})\}].$$ 
Thus, the spectral norm of $\mathbb{E}[{\overline{C}}_{t}^{\top}\cdot ({I}_{N}{\mathrm{\U0001d7cf}}_{N}{\mathrm{\U0001d7cf}}_{N}^{\top}/N)\otimes {I}_{K}\cdot {\overline{C}}_{t}]$ is strictly less than one.
(4) From the definition of ${\overline{C}}_{t}$, it is easy to see that ${\overline{C}}_{t}$ is conditionally independent of ${r}_{t+1}^{i}$ for any $i$ as all ${C}_{t}^{k}(t)$ are. This completes the proof.
References
 [1] G. Mateos, J.A. Bazerque, and G.B. Giannakis. Distributed sparse linear regression. IEEE Transactions on Signal Processing, 52(10):5262–5276, 2010.
 [2] R.K. Kolla, K. Jagannathan, and A. Gopalan. Collaborative learning of stochastic bandits over a social network. IEEE Transactions on Networking, 26(4):1782–1795, 2018.
 [3] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Başar. Fully decentralized multiagent reinforcement learning with networked agents. In International Conference on Machine Learning, pages 5872–5881, 2018.
 [4] Z. Jiang, A. Balu, C. Hegde, and S. Sarkar. Collaborative deep learning in fixed topology networks. In Advances in Neural Information Processing Systems, pages 5904–5914, 2017.
 [5] C. Boutilier. Planning, learning and coordination in multiagent decision processes. In Conference on Theoretical Aspects of Rationality and Knowledge, pages 195–210, 1996.
 [6] M. Lauer and M. Riedmiller. An algorithm for distributed reinforcement learning in cooperative multiagent systems. In International Conference on Machine Learning, pages 535–542, 2000.
 [7] M.L. Littman. Valuefunction reinforcement learning in Markov games. Cognitive Systems Research, 2(1):55–66, 2001.
 [8] X. Wang and T. Sandholm. Reinforcement learning to play an optimal Nash equilibrium in team Markov games. In Advances in Neural Information Processing Systems, pages 1603–1610, 2003.
 [9] S. Kar, J.M. Moura, and H.V. Poor. QDlearning: A collaborative distributed strategy for multiagent reinforcement learning through consensus + innovations. IEEE Transactions on Signal Processing, 61(7):1848–1862, 2013.
 [10] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Başar. Finitesample analyses for fully decentralized multiagent reinforcement learning. arXiv preprint arXiv:1812.02783, 2018.
 [11] D. Lee, H. Yoon, V. Cichella, and N. Hovakimyan. Stochastic primaldual algorithm for distributed gradient temporal difference learning. arXiv preprint arXiv:1805.07918, 2018.
 [12] T.T. Doan, S.T. Maguluri, and J. Romberg. Convergence rates of distributed TD (0) with linear function approximation for multiagent reinforcement learning. arXiv preprint arXiv:1902.07393, 2019.
 [13] J. Hu and M.P. Wellman. Nash Qlearning for generalsum stochastic games. Journal of Machine Learning Research, 4(11):1039–1069, 2003.
 [14] J. Foerster, Y.M. Assael, N. Freitas, and S. Whiteson. Learning to communicate with deep multiagent reinforcement learning. In Advances in Neural Information Processing Systems, pages 2137–2145, 2016.
 [15] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch. Multiagent actorcritic for mixed cooperativecompetitive environments. arXiv preprint arXiv:1706.02275, 2017.
 [16] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian. Deep decentralized multitask multiagent reinforcement learning under partial observability. In International Conference on Machine Learning, pages 2681–2690, 2017.
 [17] K. Zhang, Z. Yang, and T. Başar. Networked multiagent reinforcement learning in continuous spaces. In Proceedings of IEEE Conference on Decision and Control, pages 2771–2776. IEEE, 2018.
 [18] Y. Shoham, R. Powers, and T. Grenager. Multiagent reinforcement learning: A critical survey. Technical Report, 2003.
 [19] D. Kempe, A. Dobra, and J. Gehrke. Gossipbased computation of aggregate information. In Proceedings of the 44th IEEE Symposium on Foundations of Computer Science, pages 482–491, 2003.
 [20] T. Chen, K. Zhang, G. B. Giannakis, and T. Başar. Communicationefficient distributed reinforcement learning. arXiv preprint arXiv:1812.03239, 2018.
 [21] V.R. Konda and J.N. Tsitsiklis. Actorcritic algorithms. In Advances in Neural Information Processing Systems, pages 1008–1014, 2000.
 [22] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee. Natural actorcritic algorithms. Automatica, 45(11):2471–2482, 2009.
 [23] X. Gao, J. Liu, and T. Başar. Stochastic communicationefficient distributed algorithms for solving linear algebraic equations. In Proceedings of the IEEE MultiConference on Systems and Control, pages 380–385, 2016.
 [24] D. Fullmer, L. Wang, and A.S. Morse. On the distributed computation of a common fixed point of a family of paracontractions. In Proceedings of IEEE Conference on Decision and Control, pages 2289–2293, 2017.
 [25] L. Xiao, S. Boyd, and S. Lall. A scheme for robust distributed sensor fusion based on average consensus. In International Symposium on Information Processing in Sensor Networks, pages 63–70, 2005.
 [26] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms. IEEE/ACM Transactions on Networking, 14(SI):2508–2530, 2006.
 [27] T.C. Aysal, M.E. Yildiz, A.D. Sarwate, and A. Scaglione. Broadcast gossip algorithms for consensus. IEEE Transactions on Signal Processing, 57(7):2748–2761, 2009.
 [28] D. Kempe, A. Dobra, and J. Gehrke. Gossipbased computation of aggregate information. In Proceedings of IEEE Symposium on Foundations of Computer Science, pages 482–491, 2003.
 [29] A. Nedić and A. Olshevsky. Distributed optimization over timevarying directed graphs. IEEE Transactions on Automatic Control, 60(3):601–615, 2015.
 [30] P. Rezaienia, B. Gharesifard, T. Linder, and B. Touri. Convergence rate of pushsum algorithms on random graphs. In Proceedings of IEEE Conference on Decision and Control, pages 4218–4223, 2018.
 [31] J. Liu and A.S. Morse. Asynchronous distributed averaging using double linear iterations. In Proceedings of the American Control Conference, pages 6620–6625, 2012.
 [32] C.N. Hadjicostis and T. Charalambous. Average consensus in the presence of delays in directed graph topologies. IEEE Transactions on Automatic Control, 59(3):763–768, 2014.