Abstract
We consider the networked multiagent reinforcement learning (MARL) problemin a fully decentralized setting, where agents learn to coordinate to achievethe joint success. This problem is widely encountered in many areas includingtraffic control, distributed control, and smart grids. We assume that thereward function for each agent can be different and observed only locally bythe agent itself. Furthermore, each agent is located at a node of acommunication network and can exchanges information only with its neighbors.Using softmax temporal consistency and a decentralized optimization method, weobtain a principled and dataefficient iterative algorithm. In the first stepof each iteration, an agent computes its local policy and value gradients andthen updates only policy parameters. In the second step, the agent propagatesto its neighbors the messages based on its value function and then updates itsown value function. Hence we name the algorithm value propagation. We prove anonasymptotic convergence rate 1/T with the nonlinear function approximation.To the best of our knowledge, it is the first MARL algorithm with convergenceguarantee in the control, offpolicy and nonlinear function approximationsetting. We empirically demonstrate the effectiveness of our approach inexperiments.
Quick Read (beta)
Value Propagation for Decentralized Networked Deep Multiagent Reinforcement Learning
Abstract
We consider the networked multiagent reinforcement learning (MARL) problem in a fully decentralized setting, where agents learn to coordinate to achieve joint success. This problem is widely encountered in many areas including traffic control, distributed control, and smart grids. We assume each agent is located at a node of a communication network and can exchange information only with its neighbors. Using softmax temporal consistency, we derive a primaldual decentralized optimization method and obtain a principled and dataefficient iterative algorithm named value propagation. We prove a nonasymptotic convergence rate of $\mathcal{O}(1/T)$ with nonlinear function approximation. To the best of our knowledge, it is the first MARL algorithm with a convergence guarantee in the control, offpolicy, nonlinear function approximation, fully decentralized setting.
1 Introduction
Multiagent systems have applications in a wide range of areas such as robotics, traffic control, distributed control, telecommunications, and economics. For these areas, it is often difficult or simply impossible to predefine agents’ behaviour to achieve satisfactory results, and multiagent reinforcement learning (MARL) naturally arises (Bu et al., 2008; Tan, 1993). For example, ElTantawy et al. (2013) model a traffic signal control problem as a multiplayer stochastic game and solve it with MARL. MARL generalizes reinforcement learning by considering a set of agents (decision makers) sharing a common environment. However, multiagent reinforcement learning is a challenging problem since the agents interact with both the environment and each other. For instance, independent Qlearning—treating other agents as a part of the environment—often fails as the multiagent setting breaks the theoretical convergence guarantee of Qlearning and makes the learning process unstable (Tan, 1993). Rashid et al. (2018); Foerster et al. (2018); Lowe et al. (2017) alleviate such a problem using a centralized network (i.e., being centralized for training, but decentralized during execution.). Its communication pattern is illustrated in the left panel of Figure 1.
Despite the great success of (partially) centralized MARL approaches, there are various scenarios, such as sensor networks (Rabbat and Nowak, 2004) and intelligent transportation systems (Adler and Blue, 2002) , where a central agent does not exist or may be too expensive to use. In addition, privacy and security are requirements of many real world problems in multiagent system (also in many modern machine learning problems) (Abadi et al., 2016; Kurakin et al., 2016) . For instance, in Federated learning (McMahan et al., 2016), the learning task is solved by a lose federation of participating devices (agents) without the need to centrally store the data, which significantly reduces privacy and security risk by limiting the attack surface to only the device. In the agreement problem (DeGroot, 1974; Mo and Murray, 2017), a group of agents may want to reach consensus on a subject without leaking their individual goal or opinion to others. Obviously, centralized MARL violates privacy and security requirements. To this end, we and others have advocated the fully decentralized approaches, which are useful for many applications including unmanned vehicles (Fax and Murray, 2002), power grid (Callaway and Hiskens, 2011), and sensor networks (Cortes et al., 2004). For these approaches, we can use a network to model the interactions between agents (see the right panel of Figure 1). Particularly, each agent makes its own decision based on its local reward and messages received from their neighbors. Thus each agent preserves the privacy of its own goal and policy. At the same time, through the messagepassing all agents achieve consensus to maximize the averaged cumulative rewards over all agents; see Equation (4).
In this paper, we propose a new fully decentralized networked multiagent deep reinforcement learning algorithm.
Using softmax temporal consistency (Nachum et al., 2017; Dai et al., 2018)
to connect value and policy updates,
we derive a new twostep primaldual decentralized reinforcement learning algorithm inspired by a primal decentralized optimization method (Hong et al., 2017) ^{2}^{2}
2
The objective in Hong et al. (2017) is a primal optimization problem with constraint. Thus they introduce a Lagrange multiplier like method to solve it (so they call it primaldual method ). Our objective function is a primaldual optimization problem with constraint. .
In the first step of each iteration, each agent computes its local policy, value gradients and dual gradients and then updates only policy parameters. In the second step, each agent propagates to its neighbors the messages based on its value function (and dual function) and then updates its own value function. Hence we name the algorithm value propagation. It preserves the privacy in the sense that no individual reward function is required for the networkwide collaboration.
We approximate the local policy, value function and dual function of each agent by deep neural networks, which enables automatic feature generation and endtoend learning.
Contributions:

•
We propose the value propagation algorithm and prove that it converges with the rate $\mathcal{O}(1/T)$ even with the nonlinear deep neural network function approximation. To the best of our knowledge, it is the first deep MARL algorithm with nonasymptotic convergence guarantee. At the same time, value propagation can use offpolicy updates, making it data efficient. When it reduces to the single agent case, it provides a proof of (Dai et al., 2018) in the realistic setting; see remarks of algorithm 3.3 in Section 3.3.

•
The objective function in our problem is a primaldual decentralized optimization form (see (3.2)), while the objective function in (Hong et al., 2017) is a primal problem. When our method reduces to pure primal analysis, we extend (Hong et al., 2017) to the stochastic and biased gradient setting which may be of independent interest to the optimization community. In the practical implementation, we extend ADAM into the decentralized setting to accelerate training.
2 Preliminary
2.1 MDP
Markov Decision Process (MDP) can be described by a 5tuple ($\mathcal{S},\mathcal{A},\mathcal{R},\mathcal{P},\gamma $): $\mathcal{S}$ is the finite state space, $\mathcal{A}$ is the finite action space, $\mathcal{P}={(P({s}^{\prime}s,a))}_{s,{s}^{\prime}\in \mathcal{S},a\in \mathcal{A}}$ are the transition probabilities, $R={(R(s,a))}_{s,{s}^{\prime}\in \mathcal{S},a\in \mathcal{A}}$ are the realvalued immediate rewards and $\gamma \in (0,1)$ is the discount factor. A policy is used to select actions in the MDP. In general, the policy is stochastic and denoted by $\pi $, where $\pi ({s}_{t},{a}_{t})$ is the conditional probability density at ${a}_{t}$ associated with the policy. Define ${V}^{*}(s)={\mathrm{max}}_{\pi}\mathbb{E}[{\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}R({s}_{t},{a}_{t}){s}_{0}=s]$ to be the optimal value function. It is known that ${V}^{*}$ is the unique fixed point of the Bellman optimality operator,
$$V(s)=(\mathcal{T}V)(s):=\underset{a}{\mathrm{max}}R(s,a)+\gamma {\mathbb{E}}_{{s}^{\prime}s,a}[V({s}^{\prime})].$$ 
The optimal policy ${\pi}^{*}$ is related to ${V}^{*}$ by the following equation: ${\pi}^{*}(s,a)=\mathrm{arg}{\mathrm{max}}_{a}\{R(s,a)+\gamma {\mathbb{E}}_{{s}^{\prime}s,a}{V}^{*}({s}^{\prime})\}$
2.2 Softmax Temporal Consistency
Nachum et al. (2017) establish a connection between value and policy based reinforcement learning based on a relationship between softmax temporal value consistency and policy optimality under entropy regularization. Particularly, the soft Bellman optimality is as follows,
$$\begin{array}{cc}\hfill {V}_{\lambda}(s)=\underset{\pi (s,\cdot )}{\mathrm{max}}& ({\mathbb{E}}_{a\sim \pi (s,\cdot )}(R(s,a)+\hfill \\ & \gamma {\mathbb{E}}_{{s}^{\prime}s,a}{V}_{\lambda}({s}^{\prime}))+\lambda H(\pi ,s)),\hfill \end{array}$$  (1) 
where $H(\pi ,s)={\sum}_{a\in \mathcal{A}}\pi (s,a)\mathrm{log}\pi (s,a)$ and $\lambda \ge 0$ controls the degree of regularization. When $\lambda =0$, above equation reduces to the standard Bellman optimality condition. It is not hard to derive that
$$\begin{array}{cc}\hfill {V}_{\lambda}& =(\mathcal{T}{V}_{\lambda})(s)\hfill \\ & :=\lambda \mathrm{log}\left(\sum _{a\in \mathcal{A}}\mathrm{exp}(\frac{R(s,a)+\gamma {\mathbb{E}}_{{s}^{\prime}s,a}{V}_{\lambda}({s}^{\prime})}{\lambda})\right),\hfill \end{array}$$  (2) 
which is a softmax function over $R(s,a)+\gamma {\mathbb{E}}_{{s}^{\prime}s,a}{V}_{\lambda}({s}^{\prime})$.
An important property of this soft Bellman operator is the called temporal consistency, which leads to the Path Consistency Learning.
Proposition 1.
Nachum et al. (2017) Assume $\lambda \mathrm{>}\mathrm{0}$. Let ${V}_{\lambda}^{\mathrm{*}}$ be the fixed point of (1) and ${\pi}_{\lambda}^{\mathrm{*}}$ be the corresponding policy that attains that maximum on the RHS of (1). Then, $\mathrm{(}{V}_{\lambda}^{\mathrm{*}}\mathrm{,}{\pi}_{\lambda}^{\mathrm{*}}\mathrm{)}$ is the unique $\mathrm{(}V\mathrm{,}\pi \mathrm{)}$ pair that satisfies the following equation for all $\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{\in}\mathrm{S}\mathrm{\times}\mathrm{A}\mathrm{:}$
$$V(s)=R(s,a)+\gamma {\mathbb{E}}_{{s}^{\prime}s,a}V({s}^{\prime})\lambda \mathrm{log}\pi (s,a).$$ 
A straightforward way to apply temporal consistency is to optimize the following problem,
$$\underset{V,\pi}{\mathrm{min}}{E}_{s,a}{\left(R(s,a)+\gamma {\mathbb{E}}_{{s}^{\prime}s,a}V({s}^{\prime})\lambda \mathrm{log}\pi (s,a)V(s)\right)}^{2}.$$ 
Dai et al. (2018) get around the double sampling problem of above formulation by introduce a primaldual form
$$\begin{array}{cc}\hfill \underset{V,\pi}{\mathrm{min}}\underset{\rho}{\mathrm{max}}{\mathbb{E}}_{s,a,{s}^{\prime}}[& {(\delta (s,a,{s}^{\prime})V(s))}^{2}]\hfill \\ & \eta {\mathbb{E}}_{s,a,{s}^{\prime}}[{(\delta (s,a,{s}^{\prime})\rho (s,a))}^{2}],\hfill \end{array}$$  (3) 
where $\delta (s,a,{s}^{\prime})=R(s,a)+\gamma V({s}^{\prime})\lambda \mathrm{log}\pi (s,a)$, $0\le \eta \le 1$ controls the tradeoff between bias and variance.
Notation. We use $\parallel \cdot \parallel $ to denote the Euclidean norm over the vector. ${A}^{\prime}$ stands for the transpose of $A$. $\odot $ denotes the entrywise product between two vectors.
3 Value Propagation
In this section, we present our multiagent reinforcement learning algorithm, i.e., value propagation. To begin with, we extend the MDP model to the Networked Multiagent MDP model following the definition in (Zhang et al., 2018). Let $\mathcal{G}=(\mathcal{N},\mathcal{E})$ be an undirected graph with $\mathcal{N}=N$ agents (node). $\mathcal{E}$ represents the set of edges. $(i,j)\in \mathcal{E}$ means agent $i$ and $j$ can communicate with each other through this edge. A networked multiagent MDP is characterized by a tuple $(\mathcal{S},{\{{\mathcal{A}}^{i}\}}_{i\in \mathcal{N}},\mathcal{P},{\{{R}^{i}\}}_{i\in \mathcal{N}},\mathcal{G},\gamma )$: $\mathcal{S}$ is the global state space shared by all agents (It could be partially observed, i.e., each agent observes its own state ${S}^{i}$, see our experiment). ${\mathcal{A}}^{i}$ is the action space of agent $i$, $\mathcal{A}={\prod}_{i=1}^{N}{\mathcal{A}}^{i}$ is the joint action space, $\mathcal{P}$ is the transition probability, ${\mathcal{R}}^{i}$ denotes the local reward function of agent $i$. We assume rewards are observed only locally to preserve the privacy of the each agent’s goal. At each time step, agents observe ${s}_{t}$ and make the decision ${a}^{t}=({a}_{1}^{t},{a}_{2}^{t},\mathrm{\dots},{a}_{N}^{t})$. Then each agent just receives its own reward ${R}_{i}({s}_{t},{a}_{t})$, and the environment switches to the new state ${s}_{t+1}$ according to the transition probability. Furthermore, since each agent make the decisions independently, it is reasonable to assume that the policy $\pi (s,a)$ can be factorized, i.e., $\pi (s,a)={\prod}_{i=1}^{N}{\pi}^{i}(s,{a}^{i})$ (Zhang et al., 2018). We call our method fullydecentralized method, since reward is received locally, the action is executed locally by agent, critic (value function) are trained locally.
3.1 MultiAgent Softmax Temporal Consistency
The goal of the agents is to learn a policy that maximizes the longterm reward averaged over the agent, i.e.,
$$\mathbb{E}\sum _{t=0}^{\mathrm{\infty}}\frac{1}{N}\sum _{i=1}^{N}{\gamma}^{t}{R}_{i}({s}_{t},{a}_{t}).$$  (4) 
In the following, we adapt the temporal consistency into the multiagent version. Let ${V}_{\lambda}(s)=\mathbb{E}\left(\frac{1}{N}{\sum}_{i=1}^{N}{R}_{i}(s,a)+\gamma {\mathbb{E}}_{{s}^{\prime}s,a}{V}_{\lambda}({s}^{\prime})+\lambda H(\pi ,s)\right),$ ${V}_{\lambda}^{*}$ be the optimal value function and ${\pi}_{\lambda}^{*}(s,a)={\prod}_{i=1}^{N}{\pi}_{\lambda}^{i*}(s,{a}^{i})$ be the corresponding policy. Apply the soft temporal consistency, we obtain that for all $(s,a)\in \mathcal{S}\times \mathcal{A}$, $({V}_{\lambda}^{*},{\pi}_{\lambda}^{*})$ is the unique $(V,\pi )$ pair that satisfies
$$\begin{array}{c}\hfill V(s)=\frac{1}{N}\sum _{i=1}^{N}{R}_{i}(s,a)+\gamma {\mathbb{E}}_{{s}^{\prime}s,a}V({s}^{\prime})\lambda \sum _{i=1}^{N}\mathrm{log}{\pi}^{i}(s,{a}^{i}).\end{array}$$  (5) 
A optimization problem inspired by (5) would be
$$\underset{{\{{\pi}^{i}\}}_{i=1}^{N},V}{\mathrm{min}}\mathbb{E}{\left(V(s)\frac{1}{N}\sum _{i=1}^{N}{R}_{i}(s,a)\gamma {\mathbb{E}}_{{s}^{\prime}s,a}V({s}^{\prime})+\lambda \sum _{i=1}^{N}\mathrm{log}{\pi}^{i}(s,{a}^{i})\right)}^{2}.$$  (6) 
There are two potential issues of above formulation: First, due to the inner conditional expectation, it would require two independent samples to obtain the unbiased estimation of gradient of $V$ (Dann et al., 2014). Second, $V(s)$ is a global variable over the network, thus can not be updated in a decentralized way.
For the first issue, we introduce the primaldual form of (6) as that in (Dai et al., 2018). Using the fact that ${x}^{2}={\mathrm{max}}_{\nu}(2\nu x{\nu}^{2})$ and the interchangeability principle (Shapiro et al., 2009) we have,
$$\underset{V,{\{{\pi}^{i}\}}_{i=1}^{N}}{\mathrm{min}}\underset{\nu}{\mathrm{max}}2{\mathbb{E}}_{s,a,{s}^{\prime}}[\nu (s,a)(\frac{1}{N}\sum _{i=1}^{N}({R}_{i}(s,a)+\gamma V({s}^{\prime})V(s)\lambda N\mathrm{log}{\pi}^{i}(s,{a}^{i}))]{\mathbb{E}}_{s,a,s}[{\nu}^{2}(s,a)].$$ 
Change the variable $\nu (s,a)=\rho (s,a)V(s)$, the objective function becomes
$$\underset{V,{\{{\pi}^{i}\}}_{i=1}^{N}}{\mathrm{min}}\underset{\rho}{\mathrm{max}}{\mathbb{E}}_{s,a,{s}^{\prime}}[{\left(\frac{1}{N}\sum _{i=1}^{N}({\delta}_{i}(s,a,{s}^{\prime})V(s))\right)}^{2}]{\mathbb{E}}_{s,a,{s}^{\prime}}[{\left(\frac{1}{N}\sum _{i=1}^{N}({\delta}_{i}(s,a,{s}^{\prime})\rho (s,a))\right)}^{2}],$$  (7) 
where ${\delta}_{i}={R}_{i}(s,a)+\gamma V({s}^{\prime})\lambda N\mathrm{log}{\pi}^{i}(s,{a}^{i}).$
3.2 Decentralized Parametrization
So far the problem is still in a centralized form, and we now turn to reformulating it in a decentralized way. We assume that policy, value function, dual variable $\rho $ are all in the parametric function class. Particularly, each agent’s policy is ${\pi}^{i}(s,{a}^{i}):={\pi}_{{\theta}_{{\pi}^{i}}}(s,{a}^{i})$ and ${\pi}_{\theta}(s,a)={\prod}_{i=1}^{N}{\pi}_{{\theta}_{{\pi}^{i}}}(s,{a}^{i}).$ The value function ${V}_{{\theta}_{v}}(s)$ is characterized by the parameter ${\theta}_{v}$, while ${\theta}_{\rho}$ represents the parameter of $\rho (s,a)$. Similar to (Dai et al., 2018), we optimize a slightly different version from (7).
$\underset{{\theta}_{v},{\{{\theta}_{{\pi}^{i}}\}}_{i=1}^{N}}{\mathrm{min}}\underset{{\theta}_{\rho}}{\mathrm{max}}{\mathbb{E}}_{s,a,{s}^{\prime}}[{\left({\displaystyle \frac{1}{N}}{\displaystyle \sum _{i=1}^{N}}({\delta}_{i}(s,a,{s}^{\prime})V(s))\right)}^{2}]\eta {\mathbb{E}}_{s,a,{s}^{\prime}}[{\left({\displaystyle \frac{1}{N}}{\displaystyle \sum _{i=1}^{N}}({\delta}_{i}(s,a,{s}^{\prime})\rho (s,a))\right)}^{2}],$  (8) 
where $0\le \eta \le 1$ controls the bias and variance tradeoff. When $\eta =0$, it reduces to the pure primal form.
We now consider the second issue that $V(s)$ is a global variable. To address this problem, we introduce the local copy of the value function, i.e., ${V}_{i}(s)$ for each agent $i$. In the algorithm, we have a consensus update step, such that these local copies are the same, i.e., ${V}_{1}(s)={V}_{2}(s)=\mathrm{\dots}={V}_{N}(s)=V(s)$, or equivalently ${\theta}_{{v}^{1}}={\theta}_{{v}^{2}}=\mathrm{\dots}={\theta}_{{v}^{N}}$, where ${\theta}_{{v}^{i}}$ are parameter of ${V}_{i}$ respectively. Notice now in (8), there is a global dual variable $\rho $ in the primaldual form. Therefore, we also introduce the local copy of the dual variable, i.e., ${\rho}_{i}(s,a)$ to formulate it into the decentralized optimization problem. Now the final objective function we need to optimize is
$\underset{{\{{\theta}_{{v}^{i}},{\theta}_{{\pi}^{i}}\}}_{i=1}^{N}}{\mathrm{min}}$  $\underset{{\{{\theta}_{{\rho}^{i}}\}}_{i=1}^{N}}{\mathrm{max}}L({\theta}_{V},{\theta}_{\pi},{\theta}_{\rho})={\mathbb{E}}_{s,a,{s}^{\prime}}[{\left({\displaystyle \frac{1}{N}}{\displaystyle \sum _{i=1}^{N}}({\delta}_{i}(s,a,{s}^{\prime}){V}_{i}(s))\right)}^{2}]$  
$\eta {\mathbb{E}}_{s,a,{s}^{\prime}}[{\left({\displaystyle \frac{1}{N}}{\displaystyle \sum _{i=1}^{N}}({\delta}_{i}(s,a,{s}^{\prime}){\rho}_{i}(s,a))\right)}^{2}],$  
$s.t.{\theta}_{{v}^{1}}=,\mathrm{\dots},={\theta}_{{v}^{N}},{\theta}_{{\rho}^{1}}=,\mathrm{\dots},={\theta}_{{\rho}^{N}},$  (9) 
where ${\delta}_{i}={R}_{i}(s,a)+\gamma {V}_{i}({s}^{\prime})\lambda N\mathrm{log}{\pi}^{i}(s,{a}^{i}).$ We are now ready to present the value propagation algorithm. In the following, for notational simplicity, we assume the parameter of each agent is a scalar, i.e., ${\theta}_{{\rho}^{i}},{\theta}_{{\pi}^{i}},{\theta}_{{v}^{i}}\in R$. We pack the parameter together and slightly abuse the notation by writing ${\theta}_{\rho}={[{\theta}_{{\rho}^{1}},\mathrm{\dots},{\theta}_{{\rho}^{N}}]}^{\prime}$, ${\theta}_{\pi}={[{\theta}_{{\pi}^{1}},\mathrm{\dots},{\theta}_{{\pi}^{N}}]}^{\prime}$, ${\theta}_{V}={[{\theta}_{{v}^{1}},\mathrm{\dots},{\theta}_{{v}^{N}}]}^{\prime}.$ Similarly, we also pack the stochastic gradient $g({\theta}_{\rho})={[g({\theta}_{{\rho}^{1}}),\mathrm{\dots},g({\theta}_{{\rho}^{n}})]}^{\prime}$, $g({\theta}_{V})={[g({\theta}_{{v}^{1}}),\mathrm{\dots},g({\theta}_{{v}^{n}})]}^{\prime}$.
3.3 Value propagation algorithm
Solving (3.2) even without constraints is not an easy problem when both primal and dual parts are approximated by the deep neural networks. An ideal way is to optimize the inner dual problem and find the solution ${\theta}_{\rho}^{*}=\mathrm{arg}{\mathrm{max}}_{{\theta}_{\rho}}L({\theta}_{V},{\theta}_{\pi},{\theta}_{\rho})$, such that ${\theta}_{{\rho}^{1}}=\mathrm{\dots}={\theta}_{{\rho}^{N}}$. Then we can do the (decentralized) stochastic gradient decent to solve the primal problem.
$$\underset{{\{{\theta}_{{v}^{i}},{\theta}_{{\pi}^{i}}\}}_{i=1}^{N}}{\mathrm{min}}L({\theta}_{V},{\theta}_{\pi},{\theta}_{\rho}^{*})s.t.{\theta}_{{v}^{1}}=\mathrm{\dots}={\theta}_{{v}^{N}}.$$  (10) 
However in practice, one tricky issue is that we can not get the exact solution ${\theta}_{\rho}^{*}$ of the dual problem. Thus, we do the (decentralized) stochastic gradient for ${T}_{dual}$ steps in the dual problem and get an approximated solution ${\stackrel{~}{\theta}}_{\rho}$ in the Algorithm 3.3. In our analysis, we take the error $\epsilon $ generated from this inexact solution into the consideration and analyze its effect on the convergence. Particularly, since ${\nabla}_{{\theta}_{V}}L({\theta}_{V},{\theta}_{\pi},{\stackrel{~}{\theta}}_{\rho})\ne {\nabla}_{{\theta}_{V}}L({\theta}_{V},{\theta}_{\pi},{\theta}_{\rho}^{*})$, the primal gradient is biased and the results in (Dai et al., 2018; Hong et al., 2017) do not fit this problem.
In the dual update we do a consensus update ${\theta}_{\rho}^{t+1}=\frac{1}{2}{D}^{1}{L}^{+}{\theta}_{\rho}^{t}\frac{{\alpha}_{\rho}}{2}{D}^{1}{A}^{\prime}{\mu}_{\rho}^{t}+\frac{{\alpha}_{\rho}}{2}{D}^{1}g({\theta}_{\rho}^{t})$ using the stochastic gradient of each agent, where ${\mu}_{\rho}$ is some auxiliary variable to incorporate the communication, $D$ is the degree matrix, $A$ is the nodeedge incidence matrix, ${L}^{+}$ is signless graph Laplacian. We defer the detail definition and the derivation of this algorithm to Appendix A.1 and Appendix A.5 due to space limitation. After updating the dual parameters, we optimize the primal parameters ${\theta}_{v}$, ${\theta}_{\pi}$. Similarly, we use a minibatch data from the replay buffer and then do a consensus update on ${\theta}_{v}$. The same remarks on $\rho $ also hold for the primal parameter ${\theta}_{v}$. Notice here we do not need the consensus update on ${\theta}_{\pi}$, since each agent’s policy ${\pi}^{i}(s,{a}^{i})$ is different than each other. This update rule is adapted from a primal decentralized optimization algorithm (Hong et al., 2017). Notice even in the pure primal case, Hong et al. (2017) only consider the batch gradient case while our algorithm and analysis include the stochastic and biased gradient case. In practicals implementation, we consider the decentralized momentum method and multistep temporal consistency to accelerate the training; see details in Appendix A.2 and Appendix A.3.
Remarks on Algorithm 3.3. (1) In the single agent case, Dai et al. (2018) assume the dual problem can be exactly solved and thus they analyze a simple pure primal problem. However such assumption is unrealistic especially when the dual variable is represented by the deep neural network. Our multiagent analysis considers the inexact solution. This is much harder than that in (Dai et al., 2018), since now the primal gradient is biased. (2) The update of each agent just needs the information of the agent itself and its neighbors. See this from the definition of $D$, $A$, ${L}^{+}$ in the appendix. (3) The topology of the Graph $\mathcal{G}$ affects the convergence speed. In particular, the rate depends on ${\sigma}_{\mathrm{min}}({A}^{\prime}A)$ and ${\sigma}_{\mathrm{min}}(D)$, which are related to spectral gap of the network.
[h]
\[email protected]@algorithmic
\STATE Input: Environment ENV, learning rate ${\alpha}_{\pi}$, ${\alpha}_{v}$, ${\alpha}_{\rho}$, discount factor $\gamma $, number of step ${T}_{dual}$ to train dual parameter ${\theta}_{{\rho}^{i}}$, replay buffer capacity $B$, nodeedge incidence matrix $A\in {R}^{E\times N}$, degree matrix $D$, signless graph Laplacian ${L}^{+}$.
\STATEInitialization of ${\theta}_{{v}^{i}},{\theta}_{{\pi}^{i}},{\theta}_{{\rho}^{i}}$, ${\mu}_{\rho}^{0}=0$, ${\mu}_{V}^{0}=0$.
\FOR$t=1,\mathrm{\dots},T$
\STATE sample trajectory ${s}_{0:\tau}\sim \pi (s,a)={\prod}_{i=1}^{N}{\pi}^{i}(s,{a}^{i})$ and add it into the replay buffer.
1. Update the dual parameter ${\theta}_{{\rho}^{i}}$
Do following dual update ${T}_{dual}$ times:
Random sample a minibatch of transition $({s}_{t},{\{{a}_{t}^{i}\}}_{i=1}^{N},{s}_{t+1},{\{{r}_{t}^{i}\}}_{i=1}^{N})$ from the replay buffer.
agent $i=1$ to $n$
\STATECalculate the stochastic gradient $g({\theta}_{{\rho}^{i}}^{t})$ of $\eta {({\delta}_{i}({s}_{t},{a}_{t},{s}_{t+1}){\rho}_{i}({s}_{t},{a}_{t}))}^{2}$ w.r.t. ${\theta}_{{\rho}^{i}}^{t}$.
\ENDFOR
// Do consensus update on ${\theta}_{\rho}:={[{\theta}_{{\rho}^{1}},\mathrm{\dots},{\theta}_{{\rho}^{N}}]}^{\prime}$
${\theta}_{\rho}^{t+1}=\frac{1}{2}{D}^{1}{L}^{+}{\theta}_{\rho}^{t}\frac{{\alpha}_{\rho}}{2}{D}^{1}{A}^{\prime}{\mu}_{\rho}^{t}+\frac{{\alpha}_{\rho}}{2}{D}^{1}g({\theta}_{\rho}^{t}),{\mu}_{\rho}^{t+1}={\mu}_{\rho}^{t}+\frac{1}{{\alpha}_{\rho}}A{\theta}_{\rho}^{t+1}$
2. Update primal parameters ${\theta}_{{v}^{i}},{\theta}_{{\pi}^{i}}$
\STATERandom sample a minibatch of transition $({s}_{t},{\{{a}_{t}^{i}\}}_{i=1}^{N},{s}_{t+1},{\{{r}_{t}^{i}\}}_{i=1}^{N})$ from the replay buffer.
\FOR agent $i=1$ to $n$
\STATECalculate the stochastic gradient $g({\theta}_{{v}^{i}}^{t})$,$g({\theta}_{{\pi}^{i}}^{t})$ of ${({\delta}_{i}({s}_{t},{a}_{t},{s}_{t+1}){V}_{i}({s}_{t}))}^{2}\eta {({\delta}_{i}({s}_{t},{a}_{t},{s}_{t+1}){\rho}_{i}({s}_{t},{a}_{t}))}^{2}$, w.r.t. ${\theta}_{{v}^{i}}^{t}$, ${\theta}_{{\pi}^{i}}^{t}$
\ENDFOR
// Do gradient decent on ${\theta}_{{\pi}^{i}}$:
${\theta}_{{\pi}^{i}}^{t+1}={\theta}_{{\pi}^{i}}^{t}{\alpha}_{\pi}g({\theta}_{{\pi}^{i}}^{t})$ for each agent $i$.
// Do consensus update on ${\theta}_{V}:={[{\theta}_{{v}^{1}},\mathrm{\dots},{\theta}_{{v}^{N}}]}^{\prime}$ :
${\theta}_{V}^{t+1}=\frac{1}{2}{D}^{1}{L}^{+}{\theta}_{V}^{t}\frac{{\alpha}_{v}}{2}{D}^{1}{A}^{\prime}{\mu}_{V}^{t}\frac{{\alpha}_{v}}{2}{D}^{1}g({\theta}_{V}^{t}),{\mu}_{V}^{t+1}={\mu}_{V}^{t}+\frac{1}{{\alpha}_{v}}A{\theta}_{V}^{t+1}.$
\ENDFOR
4 Theoretical Result
In this section, we give the convergence result on Algorithm 3.3. We first make two mild assumptions on the function approximators $f(\theta )$ of ${V}_{i}(s)$, ${\pi}^{i}(s,{a}^{i})$, ${\rho}_{i}(s,a)$.
Assumption 1.
i) The function approximator $f(\theta )$ is differentiable and has Lipschitz continuous gradient, i.e., $\parallel \nabla f({\theta}_{1})\nabla f({\theta}_{2})\parallel \le L\parallel {\theta}_{1}{\theta}_{2}\parallel ,\forall {\theta}_{1},{\theta}_{2}\in {R}^{K}.$ This is commonly assumed in the nonconvex optimization. ii) The function approximator $f(\theta )$ is lower bounded. This can be easily satisfied when the parameter is bounded, i.e., $\parallel \theta \parallel \le C$ for some positive constant $C$.
In the following, we give the theoretical analysis for Algorithm 3.3 in the same setting of (Antos et al., 2008; Dai et al., 2018) where samples are prefixed and from one single $\beta $mixing offpolicy sample path. We denote $\widehat{L}({\theta}_{V},{\theta}_{\pi})={\mathrm{max}}_{{\theta}_{\rho}}L({\theta}_{V},{\theta}_{\pi},{\theta}_{\rho}),s.t.,{\theta}_{{\rho}^{1}}=,\mathrm{\dots},={\theta}_{{\rho}^{N}}$
Theorem 1.
Let the function approximators of ${V}_{i}\mathit{}\mathrm{(}s\mathrm{)}$, ${\pi}^{i}\mathit{}\mathrm{(}s\mathrm{,}{a}^{i}\mathrm{)}$ and ${\rho}_{i}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}$ satisfy Assumption 1, snd denote the total training step be $T$. We solve the inner dual problem with a approximated solution ${\stackrel{\mathrm{~}}{\theta}}_{\rho}\mathrm{=}{\mathrm{(}{\stackrel{\mathrm{~}}{\theta}}_{{\rho}^{\mathrm{1}}}\mathrm{,}\mathrm{\dots}\mathrm{,}{\stackrel{\mathrm{~}}{\theta}}_{{\rho}^{N}}\mathrm{)}}^{\mathrm{\prime}}$, such that $\mathrm{\parallel}{\mathrm{\nabla}}_{{\theta}_{V}}\mathit{}L\mathit{}\mathrm{(}{\theta}_{V}\mathrm{,}{\theta}_{\pi}\mathrm{,}{\stackrel{\mathrm{~}}{\theta}}_{\rho}\mathrm{)}\mathrm{}{\mathrm{\nabla}}_{{\theta}_{V}}\mathit{}L\mathit{}\mathrm{(}{\theta}_{V}\mathrm{,}{\theta}_{\pi}\mathrm{,}{\theta}_{\rho}^{\mathrm{*}}\mathrm{)}\mathrm{\parallel}\mathrm{\le}{c}_{\mathrm{1}}\mathrm{/}\sqrt{T}$, and $\mathrm{\parallel}{\mathrm{\nabla}}_{{\theta}_{\pi}}\mathit{}L\mathit{}\mathrm{(}{\theta}_{V}\mathrm{,}{\theta}_{\pi}\mathrm{,}{\stackrel{\mathrm{~}}{\theta}}_{\rho}\mathrm{)}\mathrm{}{\mathrm{\nabla}}_{{\theta}_{\pi}}\mathit{}L\mathit{}\mathrm{(}{\theta}_{V}\mathrm{,}{\theta}_{\pi}\mathrm{,}{\theta}_{\rho}^{\mathrm{*}}\mathrm{)}\mathrm{\parallel}\mathrm{\le}{c}_{\mathrm{2}}\mathrm{/}\sqrt{T}$. Assume the variance of the stochastic gradient $g\mathit{}\mathrm{(}{\theta}_{V}\mathrm{)}$, $g\mathit{}\mathrm{(}{\theta}_{\pi}\mathrm{)}$ and $g\mathit{}\mathrm{(}{\theta}_{\rho}\mathrm{)}$ (estimated by a single sample) are bounded by ${\sigma}^{\mathrm{2}}$, the size of the minibatch is $\sqrt{T}$, the step size ${\alpha}_{\pi}\mathrm{,}{\alpha}_{v}\mathrm{,}{\alpha}_{\rho}\mathrm{\propto}\frac{\mathrm{1}}{L}$. Then value propagation in Algorithm 3.3 converges to the stationary solution of $\widehat{L}\mathit{}\mathrm{(}{\theta}_{V}\mathrm{,}{\theta}_{\pi}\mathrm{)}$ with rate $\mathrm{O}\mathit{}\mathrm{(}\mathrm{1}\mathrm{/}T\mathrm{)}\mathrm{.}$
Remarks: (1) The convergence criteria and its dependence on the network structure are involved. We defer the definition of them to the proof section in the appendix (Equation (45)). (2) We require that the approximated dual solution ${\stackrel{~}{\theta}}_{\rho}$ are not far from ${\theta}_{\rho}^{*}$ such that the estimation of the primal gradient of ${\theta}_{v}$ and ${\theta}_{\pi}$ are not far from the true one (the distance is less than $\mathcal{O}(1/\sqrt{T})$). Once the inner dual problem is concave, we can get this approximated solution easily using vanilla decentralized stochastic gradient method after at most $T$ steps. If the dual problem is nonconvex, we still can show the dual problem converges to some stationary solution with rate $\mathcal{O}(1/T)$ by our proof. (3) In the theoretical analysis, the stochastic gradient estimated from the minibatch (rather than the estimation from a single sample ) is common in nonconvex analysis, see the work (Ghadimi and Lan, 2016). In practice, a minibatch of samples is commonly used in training deep neural network.
5 Related work
Among related work on MARL, the setting of Zhang et al. (2018) is close to ours, where the author proposed a fully decentralized multiagent ActorCritic algorithm to maximize the expected timeaverage reward ${lim}_{T\to \mathrm{\infty}}\frac{1}{T}\mathbb{E}{\sum}_{t=1}^{T}\frac{1}{n}{\sum}_{i=1}^{n}{r}_{i}^{t}$. They provide the asymptotic convergence analysis on the onpolicy and linear function approximation setting. In our work, we consider the discounted reward setup, i.e., equation (4). Our algorithm includes both onpolicy and offpolicy setting thus can exploit data more efficiently. Furthermore, we provide a convergence rate $\mathcal{O}(\frac{1}{T})$ in the nonlinear function approximation setting which is much stronger than the result in Zhang et al. (2018). wai2018multi also consider a decentralized MARL but in the policy evaluation setting, while our work is on the control setting, a harder problem. Littman (1994) proposed the framework of the Markov Game which can be applied to collaborative and competitive setting Lauer and Riedmiller (2000); Hu and Wellman (2003). These early work considered the tabular case thus can not apply to the real problem with large state space. Recent works Foerster et al. (2016, 2018); Rashid et al. (2018); Raileanu et al. (2018); Jiang et al. (2018) have exploited powerful deep learning and obtained some promising empirical results. However most of them lacks theoretical guarantees while our work provides convergence analysis. Arslan and Yüksel (2017) consider the case where the reward function of each agent is the same, which is different from our setting. In the multitask reinforcement learning, the reward function among different tasks can be different Teh et al. (2017), and each agent may learn to share information or transfer useful information from one task to another to accelerate learning. However, in these previous works, there are no interactions between agents as in our setting. Notice most of MARL work are in the fashion of centralized training and decentralized execution. In the training, they do not have the constraint on the communication, while our work has a network structure which is more realistic to train a large system as we mentioned. There are a plethora of works on the deep reinforcement learning, thus we just mention some of them. Goyal et al. (2019) consider the information bottleneck by leveraging the mutual information to seek out the decision state. Nachum et al. (2017) proposed a Path Consistency Learning algorithm which builds the connection between value and policy based reinforcement learning using a relationship between softmax temporal value consistency and policy optimality under entropy regularization. Dai et al. (2018) alleviated the double sampling problem in Nachum et al. (2017) by introducing a dual variable which can tune the biasvariance tradeoffs. Our work can be thought as a multiagent version of above works. The challenge arises from how to diffuse the information of each local agent efficiently to the whole network while remains the convergence property of the original algorithm.
6 Experimental result
The goal of our experiment is twofold: To better understand the effect of each component in the proposed algorithm; and to evaluate efficiency of value propagation in the offpolicy setting. To this end, we first do an ablation study on a simple random MDP problem, we then evaluate the performance on the cooperative navigation task (Lowe et al., 2017). The settings of the experiment are similar to those in (Zhang et al., 2018). Some implementation details are deferred to Appendix A.4 due to space constraints.
6.1 Ablation Study
In this experiment, we test effect of several components of our algorithm such as the consensus update, dual formulation in a random MDP problem. Particularly we answer following three questions: (1) Whether an agent can get consensus through messagepassing in value propagation even when each agent just knows its local reward. (2) How much performance does the decentralized approach sacrifice comparing with centralized one? (3) What is the effect of the dual part in our formulation ($0\le \eta \le 1$ and $\eta =0$ corresponds to the pure primal form)?
We compare value propagation with the centralized PCL. The centralized PCL means that there is a central node to collect rewards of all agent, thus it can optimize the objective function (6) using the single agent PCL algorithm (Nachum et al., 2017; Dai et al., 2018). Ideally, value propagation should converges to the same long term reward with the one achieved by the centralized PCL. In the experiment, we consider a multiagent RL problem with $N=10$ and $N=20$ agents, where each agent has two actions. A discrete MDP is randomly generated with $\mathcal{S}=32$ states. The transition probabilities are distributed uniformly with a small additive constant to ensure ergodicity of the MDP, which is $\mathcal{P}({s}^{\prime}a,s)\propto {p}_{s{s}^{\prime}}^{a}+{10}^{5},{p}_{s{s}^{\prime}}^{a}\sim U[0,1]$. For each agent $i$ and each stateaction pair $(s,a)$, the reward ${R}_{i}(s,a)$ is uniformly sampled from $[0,4]$.
In the left panel of Figure 2, we verify that the value function ${v}_{i}(s)$ in value propagation reaches the consensus through messagepassing in the end of the training. Particularly, we randomly choose three agent $i$, $j$, $k$ and draw their value functions over 20 randomly picked states. It is easy to see that value functions ${v}_{i}(s)$, ${v}_{j}(s)$, ${v}_{k}(s)$ over these states are almost same. This is accomplished by the consensus update in value propagation. In the middle and right panel of Figure 2, we compare the result of value propagation with centralized PCL and evaluate the effect of the dual part of value propagation. Particularly, we pick $\eta =0,0.01,0.1,1$ in the experiment, where $\eta =0$ corresponds to the pure primal formulation. When $\eta $ is too large ($\eta =1$), the algorithm would have large variance while $\eta =0$ the algorithm has some bias. Thus value propagation with $\eta =0.1,0.01$ has better result. We also see that value propagation ($\eta =0.1,0.01$) and centralized PCL converge to almost the same value, although there is a gap between centralized and decentralized algorithm. The centralized PCL converges faster than value propagation, since it does not need time to diffuse the reward information over the network.
6.2 Cooperative Navigation task
The aim of this section is to demonstrate that the value propagation outperforms decentralized multiagent ActorCritc (MAAC)Zhang et al. (2018) and the Multiagent PCL without communication. We also test our algorithm when the input state of the actor is partially observed, although our theorem just includes the fully observed case. Here PCL without communication means each agents maintains its own estimation of policy ${\pi}^{i}(s,{a}^{i})$ and value function ${V}^{i}(s)$ but there is no communication Graph. Notice this is different from the centralized PCL in section 6.1, where centralized PCL has a central node to collect all reward information and thus do not need further communication. Remark that the original MAAC is designed for the averaged reward setting thus we adapt it into the discounted case to fit our setting. We test the value propagation in the environment of the Cooperative Navigation task Lowe et al. (2017), where agents need to reach a set of $L$ landmarks trough physical movement. We modify this environment to fit our setting. Each agent can observe the position of the landmarks and other agents. The reward is given when the agent reaches its own landmarks. The penalty is received if agents collide with any other agents. Since the position of Landmarks are different, the reward function of each agent is different. Here we assume the state is globally observable which means position of the landmarks and other agents are observable to each agent. In particular, we assume the environment is in a rectangular region with size $2\times 2$. There are $N=8$ or $N=16$ agents. Each agent has a single target landmark, i.e., $L=N$, which is randomly located in the region. Each agent has five actions which corresponds to going up, down, left, right with units 0.1 or staying at the position. The agent has high probability (0.95) to move in the direction following its action and go in other direction randomly otherwise. The length of each epoch is set to be 500. When the agent is close enough to the landmark, e.g., the distance is less than 0.1, we think it reaches the target and gets reward $+5$. When two agents are close to each other (with distance less than 0.1), we treat this case as a collision and a penalty $1$ is received for each agent. The state includes the position of the landmarks and agents. The communication graph and mixing matrix are generated as that in section 6.1 with connectivity ratio 4/N. The value function ${v}_{i}(s)$ is approximated by a twohiddenlayer neural network with Relu as the activation function where inputs are the state information. Each hiddenlayer has $40$ hidden units. The dual function $\rho (s,a)$ is also approximated by a twohiddenlayer neural network, where the only difference is that inputs are stateaction pairs (s,a). The policy is approximated by a onehiddenlayer neural network with Relu as the activation function. The number of the hidden units is $32$. The output is the softmax function to approximate ${\pi}^{i}(s,{a}^{i})$. When the input of the actor is partially observed, we denote actor i as ${\pi}^{i}({o}^{i},{a}^{i})$ and assume the agent $i$ can obtain the state information of nearest $l$ ($l$=3 and $4$ in middle and right panel of Figure 3 respectively) neighbors. In all experiments, we use the multistep version of value propagation and choose $k=4$. We choose $\gamma =0.95$, $\lambda =0.01$. The learning rate of Adam is chosen as 5e4 and ${\beta}_{1},{\beta}_{2}$ are default value in Adam optimizer. The setting of PCL without communication is exactly same with value propagation except the absence of communication network.
In the left panel of Figure 3, we see the value function ${v}_{i}(s)$ reaches consensus in value propagation. In the middle and right panel of Figure 3, we compare value propagation with PCL without communication and MAAC. In PCL without communication, each agent maintains its own policy, value function and dual variable, which is trained by the algorithm SBEED Dai et al. (2018) with $\eta =0.01$. Since there is no communication between agents, intuitively agents may have more collisions in the learning process than those in value propagation. Indeed, In the middle and right panel, we see value propagation learns the policy much faster than PCL without communication. We also observe that value propagation outperforms MAAC. One possible reason is that value propagation is an offpolicy method thus we can apply experience replay which exploits data more efficiently than the onpolicy method MAAC. We also test the performance of value propagation (result labeled as partial value propagation in the figure) when the state information of actor is partially observed. Since the agent has limited information, its performance is worse than the fully observed case. But it is better than the PCL without communication (fully observed state).
In the following, we consider a more realistic setting in the traffic. For each agent, we count the number of other agents within a radius of a agent as the penalty to mimic the congestion level in the real traffic. This setting is more challenging than the previous experiment where the penalty is just 0 or 1. Particularly, there are totally 10 agents and initialized positions of them are uniformly randomly picked in $[1.6,1.6]\times [1.6,1.6]$. The destination is set to be a random point with distance $1.0$ to its starting point. The collision radius is set to be 0.15. For instance, if there are 3 other agents appearing within the radius of agent $i$, the penalty of agent $i$ is $3$. Other settings such as neural networks, transition and termination condition are the same with previous experiment. In Figure 4, we present the result.
In the early stage of the training, the penalty of agents are large (around 4) since we consider the congestion level and agents are initialized relatively closer to each other. After about 500 episodes training, value propagation starts to learn the coordination and outperforms the PCL without communication. At the end of the training, the reward of PCL without communication is around 2 while value propagation is around 7.2 and 9. MAAC learns the policy slowly due to its onpolicy learning nature.
7 Conclusions
We have presented value propagation, a new principled and dataefficient algorithm for fully decentralized deep multiagent reinforcement learning. First, it can be trained and executed via message passing between networked agents, scalable for large realworld applications. Second, we have provided a theoretical convergence guarantee with deep neural approximation. Third, we have demonstrated its effectiveness empirically.
References
 Abadi et al. (2016) Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318. ACM, 2016.
 Adler and Blue (2002) Jeffrey L Adler and Victor J Blue. A cooperative multiagent transportation management and route guidance system. Transportation Research Part C: Emerging Technologies, 10(56):433–454, 2002.
 Antos et al. (2008) András Antos, Csaba Szepesvári, and Rémi Munos. Learning nearoptimal policies with bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.
 Arslan and Yüksel (2017) Gürdal Arslan and Serdar Yüksel. Decentralized qlearning for stochastic teams and games. IEEE Transactions on Automatic Control, 62(4):1545–1558, 2017.
 Boyd et al. (2006) Stephen Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized gossip algorithms. IEEE transactions on information theory, 52(6):2508–2530, 2006.
 Bu et al. (2008) Lucian Bu, Robert Babu, Bart De Schutter, et al. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2):156–172, 2008.
 Callaway and Hiskens (2011) Duncan S Callaway and Ian A Hiskens. Achieving controllability of electric loads. Proceedings of the IEEE, 99(1):184–199, 2011.
 Cattivelli et al. (2008) Federico S Cattivelli, Cassio G Lopes, and Ali H Sayed. Diffusion recursive leastsquares for distributed estimation over adaptive networks. IEEE Transactions on Signal Processing, 56(5):1865–1877, 2008.
 Cortes et al. (2004) Jorge Cortes, Sonia Martinez, Timur Karatas, and Francesco Bullo. Coverage control for mobile sensing networks. IEEE Transactions on robotics and Automation, 20(2):243–255, 2004.
 Dai et al. (2018) Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. Sbeed: Convergent reinforcement learning with nonlinear function approximation. In International Conference on Machine Learning, pages 1133–1142, 2018.
 Dann et al. (2014) Christoph Dann, Gerhard Neumann, and Jan Peters. Policy evaluation with temporal differences: A survey and comparison. The Journal of Machine Learning Research, 15(1):809–883, 2014.
 DeGroot (1974) Morris H DeGroot. Reaching a consensus. Journal of the American Statistical Association, 69(345):118–121, 1974.
 Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
 ElTantawy et al. (2013) Samah ElTantawy, Baher Abdulhai, and Hossam Abdelgawad. Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (marlinatsc): methodology and largescale application on downtown toronto. IEEE Transactions on Intelligent Transportation Systems, 14(3):1140–1150, 2013.
 Fax and Murray (2002) J Alexander Fax and Richard M Murray. Information flow and cooperative control of vehicle formations. In IFAC World Congress, volume 22, 2002.
 Foerster et al. (2016) Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate with deep multiagent reinforcement learning. In Advances in Neural Information Processing Systems, pages 2137–2145, 2016.
 Foerster et al. (2018) Jakob N Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multiagent policy gradients. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 Ghadimi and Lan (2016) Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(12):59–99, 2016.
 Goyal et al. (2019) Anirudh Goyal, Riashat Islam, Daniel Strouse, Zafarali Ahmed, Matthew Botvinick, Hugo Larochelle, Sergey Levine, and Yoshua Bengio. Infobot: Transfer and exploration via the information bottleneck. arXiv preprint arXiv:1901.10902, 2019.
 Hong (2016) Mingyi Hong. Decomposing linearly constrained nonconvex problems by a proximal primal dual approach: Algorithms, convergence, and applications. arXiv preprint arXiv:1604.00543, 2016.
 Hong et al. (2017) Mingyi Hong, Davood Hajinezhad, and MingMin Zhao. Proxpda: The proximal primaldual algorithm for fast distributed nonconvex optimization and learning over networks. In International Conference on Machine Learning, pages 1529–1538, 2017.
 Hu and Wellman (2003) Junling Hu and Michael P Wellman. Nash qlearning for generalsum stochastic games. Journal of machine learning research, 4(Nov):1039–1069, 2003.
 Jiang et al. (2018) Jiechuan Jiang, Chen Dun, and Zongqing Lu. Graph convolutional reinforcement learning for multiagent cooperation. arXiv preprint arXiv:1810.09202, 2018.
 Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kurakin et al. (2016) Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236, 2016.
 Lauer and Riedmiller (2000) Martin Lauer and Martin Riedmiller. An algorithm for distributed reinforcement learning in cooperative multiagent systems. In In Proceedings of the Seventeenth International Conference on Machine Learning. Citeseer, 2000.
 Littman (1994) Michael L Littman. Markov games as a framework for multiagent reinforcement learning. In Machine Learning Proceedings 1994, pages 157–163. Elsevier, 1994.
 Lowe et al. (2017) Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multiagent actorcritic for mixed cooperativecompetitive environments. In Advances in Neural Information Processing Systems, pages 6379–6390, 2017.
 McMahan et al. (2016) H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communicationefficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.
 Mo and Murray (2017) Yilin Mo and Richard M Murray. Privacy preserving average consensus. IEEE Transactions on Automatic Control, 62(2):753–765, 2017.
 Nachum et al. (2017) Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pages 2775–2785, 2017.
 Rabbat and Nowak (2004) Michael Rabbat and Robert Nowak. Distributed optimization in sensor networks. In Proceedings of the 3rd international symposium on Information processing in sensor networks, pages 20–27. ACM, 2004.
 Raileanu et al. (2018) Roberta Raileanu, Emily Denton, Arthur Szlam, and Rob Fergus. Modeling others using oneself in multiagent reinforcement learning. arXiv preprint arXiv:1802.09640, 2018.
 Rashid et al. (2018) Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multiagent reinforcement learning. arXiv preprint arXiv:1803.11485, 2018.
 Shapiro et al. (2009) Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczyński. Lectures on stochastic programming: modeling and theory. SIAM, 2009.
 Shi et al. (2015) Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact firstorder algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015.
 Tan (1993) Ming Tan. Multiagent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pages 330–337, 1993.
 Teh et al. (2017) Yee Teh, Victor Bapst, Wojciech M Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pages 4496–4506, 2017.
 (39) T Tieleman and G Hinton. Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machine learning. Technical report, Technical Report. Available online: https://zh. coursera. org/learn ….
 Xiao et al. (2005) Lin Xiao, Stephen Boyd, and Sanjay Lall. A scheme for robust distributed sensor fusion based on average consensus. In Information Processing in Sensor Networks, 2005. IPSN 2005. Fourth International Symposium on, pages 63–70. IEEE, 2005.
 Zhang et al. (2018) Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Başar. Fully decentralized multiagent reinforcement learning with networked agents. International Conference on Machine Learning, 2018.
Appendix: Value Propagation for Decentralized Networked Deep Multiagent Reinforcement Learning
Appendix A
A.1 Topology of the Graph
Here, we explain the matrix in the Algorithm 3.3 which are closely related to the topology of the Graph, which is left from the main paper due to the limit of the space.

•
$D=diag[{d}_{1},\mathrm{\dots},{d}_{N}]$ is the degree matrix, with ${d}_{i}$ denoting the degree of node $i$.

•
$A$ is the nodeedge incidence matrix: if $e\in \mathcal{E}$ and it connects vertex i and j with $i>j$, then ${A}_{ev}=1$ if $v=i$, ${A}_{ev}=1$ if $v=j$ and ${A}_{ev}=0$ otherwise.

•
The signless incidence matrix $B:=A$, where the absolute value is taken for each component of $A$.

•
The signless graph Laplacian ${L}^{+}={B}^{T}B$. By definition ${L}^{+}(i,j)=0$ if $(i,j)\notin \mathcal{E}$. Notice the nonzeros element in $A$, ${L}^{+}$, the update just depends on each agent itself and its neighbor.
A.2 Practical Acceleration
The algorithm 3.3 trains the agent with vanilla gradient decent method with a extra consensus update. In practice, the adaptive momentum gradient methods including Adagrad Duchi et al. [2011], Rmsprop Tieleman and Hinton and Adam Kingma and Ba [2014] have much better performance in training the deep neural network. We adapt Adam in our setting, and propose algorithm A.2 which has better performance than algorithm 3.3 in practice.
[htb!]
\[email protected]@algorithmic
\STATE Input: Environment ENV, learning rate ${\beta}_{1},{\beta}_{2}\in [0,1)$, ${\alpha}_{t}$, discount factor $\gamma $, a mixing matrix $W$, number of step ${T}_{dual}$ to train dual parameter ${\theta}_{\rho}^{i}$, replay buffer capacity $B$.
\STATEInitialization of ${\theta}_{{v}^{i}},{\theta}_{{\pi}^{i}},{\theta}_{{\rho}^{i}}$, moment vectors ${m}_{{v}^{i}}^{0}={m}_{{\rho}^{i}}^{0}=0$, ${w}_{{v}^{i}}^{0}={w}_{{\rho}^{i}}^{0}=0$ .
\FOR$t=1,\mathrm{\dots},T$
\STATE sample trajectory ${s}_{0:\tau}\sim \pi (s,a)={\prod}_{i=1}^{N}{\pi}^{i}(s,{a}^{i})$ and add it into the replay buffer.
// Update the dual parameter ${\theta}_{{\rho}^{i}}$
Do following update ${T}_{dual}$ times:
\STATERandom sample a minibatch of transition $({s}_{t},{\{{a}_{t}^{i}\}}_{i=1}^{N},{s}_{t+1},{\{{r}_{t}^{i}\}}_{i=1}^{N})$ from the replay buffer.
agent $i=1$ to $n$
\STATECalculate the stochastic gradient $g({\theta}_{{\rho}^{i}}^{t})$ of $\eta {({\delta}_{i}({s}_{t},{a}_{t},{s}_{t+1}){\rho}_{i}({s}_{t},{a}_{t}))}^{2}$ w.r.t. ${\theta}_{{\rho}^{i}}^{t}$.
// update momentum parameters:
${m}_{{\rho}^{i}}^{t}={\beta}_{1}{m}_{{\rho}^{i}}^{t1}+(1{\beta}_{1})(g({\theta}_{{\rho}^{i}}^{t}))$
${w}_{{\rho}^{i}}^{t}={\beta}_{2}{w}_{{\rho}^{i}}^{t1}+(1{\beta}_{2})g({\theta}_{{\rho}^{i}}^{t})\odot g({\theta}_{{\rho}^{i}}^{t})$
\ENDFOR
// Do consensus update for each agent $i$
${\theta}_{{\rho}^{i}}^{t+\frac{1}{2}}={\sum}_{j=1}^{N}{[W]}_{ij}{\theta}_{{\rho}^{j}}^{t},$ ${\theta}_{{\rho}^{i}}^{t}={\theta}_{{\rho}^{i}}^{t+\frac{1}{2}}{\alpha}_{t}\frac{{m}_{{\rho}^{i}}^{t}}{\sqrt{{w}_{{\rho}^{i}}^{t}}}$
// End the update of dual problem
// Update primal parameters ${\theta}_{{v}^{i}},{\theta}_{{\pi}^{i}}$.
\STATERandom sample a minibatch of transition $({s}_{t},{\{{a}_{t}^{i}\}}_{i=1}^{N},{s}_{t+1},{\{{r}_{t}^{i}\}}_{i=1}^{N})$ from the replay buffer.
\FOR agent $i=1$ to $n$
\STATECalculate the stochastic gradient $g({\theta}_{{v}^{i}}^{t})$,$g({\theta}_{{\pi}^{i}}^{t})$ of ${({\delta}_{i}({s}_{t},{a}_{t},{s}_{t+1}){V}_{i}({s}_{t}))}^{2}\eta {({\delta}_{i}({s}_{t},{a}_{t},{s}_{t+1}){\rho}_{i}({s}_{t},{a}_{t}))}^{2}$, w.r.t. ${\theta}_{{v}^{i}}^{t}$, ${\theta}_{{\pi}^{i}}^{t}$
//update the momentum parameter:
${m}_{{v}^{i}}^{t}={\beta}_{1}{m}_{{v}^{i}}^{t1}+(1{\beta}_{1})g({\theta}_{{v}^{i}}^{t})$
${w}_{{v}^{i}}^{t}={\beta}_{2}{w}_{{v}^{i}}^{t1}+(1{\beta}_{2})g({\theta}_{{v}^{i}}^{t})\odot g({\theta}_{{v}^{i}}^{t})$
// Using Adam to update ${\theta}_{{\pi}^{i}}$ for each agent $i$.
// Do consensus update on ${\theta}_{{v}^{i}}$ for each agent $i$:
${\theta}_{{v}^{i}}^{t+\frac{1}{2}}={\sum}_{j=1}^{N}{[W]}_{ij}{\theta}_{{v}^{j}}^{t},$ ${\theta}_{{v}^{i}}^{t}={\theta}_{{v}^{i}}^{t+\frac{1}{2}}{\alpha}_{t}\frac{{m}_{{v}^{i}}^{t}}{\sqrt{{w}_{{v}^{i}}^{t}}}$
\ENDFOR\ENDFOR
Mixing Matrix: In Algorithm A.2, there is a mixing matrix $W\subset {R}^{N\times N}$ in the consensus update. As its name suggests, it mixes information of the agent and its neighbors. This nonnegative matrix $W$ need to satisfy the following condition.

•
$W$ needs to be doubly stochastic, i.e., ${W}^{T}\text{\U0001d7cf}=\text{\U0001d7cf}$ and $W\text{\U0001d7cf}=\text{\U0001d7cf}$.

•
$W$ respects the communication graph $\mathcal{G}$, i.e., $W(i,j)=0$ if $(i,j)\notin \mathcal{E}$.

•
The spectral norm of ${W}^{T}(I{\text{\U0001d7cf}\text{\U0001d7cf}}^{T}/N)W$ is strictly smaller than one.
Here is one particular choice of the mixing matrix $W$ used in our work which satisfies above requirement called Metropolis weights Xiao et al. [2005].
$$\begin{array}{c}\hfill W(i,j)=1+\mathrm{max}{[d(i),d(j)]}^{1},\forall (i,j)\in \mathcal{E},\\ \hfill W(i,i)=1\sum _{j\in NE(i)}W(i,j),\forall i\in \mathcal{N},\end{array}$$  (11) 
where $NE(i)=\{j\in \mathcal{N}:(i,j)\in \mathcal{E}\}$ is the set of neighbors of the agent $i$ and $d(i)=\mathcal{N}(i)$ is the degree of agent $i$. Such mixing matrix is widely used in decentralized and distributed optimization Boyd et al. [2006], Cattivelli et al. [2008]. The update rule of the momentum term in Algorithm A.2 is adapted from Adam. The consensus (communication) steps are ${\theta}_{{\rho}^{i}}^{t+\frac{1}{2}}={\sum}_{j=1}^{N}{[W]}_{ij}{\theta}_{{\rho}^{j}}^{t}$ and ${\theta}_{{v}^{i}}^{t+\frac{1}{2}}={\sum}_{j=1}^{N}{[W]}_{ij}{\theta}_{{v}^{j}}^{t}$.
A.3 Multistep Extension on value propagation
The temporal consistency can be extended to the multistep case Nachum et al. [2017], where the following equation holds
$$\begin{array}{c}\hfill V({s}_{0})=\sum _{t=0}^{k1}{\gamma}^{t}{\mathbb{E}}_{{s}_{t}{s}_{0},{a}_{0:t1}}[R({s}_{t},{a}_{t})\lambda \mathrm{log}\pi ({s}_{t},{a}_{t}^{i})]+{\gamma}^{k}{\mathbb{E}}_{{s}_{k}{s}_{0},{a}_{0:k1}}V({s}_{k}).\end{array}$$ 
Thus in the objective function (3.2), we can replace ${\delta}_{i}$ by ${\delta}_{i}({s}_{0:k},{a}_{0:k1})={\sum}_{t=0}^{k1}{\gamma}^{t}\left({R}_{i}({s}_{t},{a}_{t})\lambda N\mathrm{log}{\pi}^{i}({s}_{t},{a}_{t}^{i})\right)+{\gamma}^{k}{V}_{i}({s}_{k})$ and change the estimation of stochastic gradient correspondingly in Algorithm 3.3 and Algorithm A.2 to get the multistep version of vaue propagation . In practice, the performance of setting $k>1$ is better than $k=1$ which is also observed in single agent case Nachum et al. [2017], Dai et al. [2018]. We can tune $k$ for each application to get the best performance.
A.4 Implementation details of the experiments
Ablation Study
The value function ${v}_{i}(s)$ and dual variable ${\rho}_{i}(s,a)$ are approximated by two hiddenlayer neural network with Relu as the activation function where each hiddenlayer has $20$ hidden units. The policy of each agent is approximated by a one hiddenlayer neural network with Relu as the activation function where the number of the hidden units is $32$. The output is the softmax function to approximate ${\pi}^{i}(s,{a}^{i})$. The mixing matrix in Algorithm A.2 is selected as the Metropolis Weights in (11). The graph $\mathcal{G}$ is generated by randomly placing communication links among agents such that the connectivity ratio is $4/N$. We set $\gamma =0.9$, $\lambda =0.01$, learning rate $\alpha $=5e4. The choice of ${\beta}_{1}$, ${\beta}_{2}$ are the default value in Adam.
Cooperative Navigation task
The value function ${v}_{i}(s)$ is approximated by a twohiddenlayer neural network with Relu as the activation function where inputs are the state information. Each hiddenlayer has $40$ hidden units. The dual function $\rho (s,a)$ is also approximated by a twohiddenlayer neural network, where the only difference is that inputs are stateaction pairs (s,a). The policy is approximated by a onehiddenlayer neural network with Relu as the activation function. The number of the hidden units is $32$. The output is the softmax function to approximate ${\pi}^{i}(s,{a}^{i})$. In all experiments, we use the multistep version of value propagation and choose $k=4$. We choose $\gamma =0.95$, $\lambda =0.01$. The learning rate of Adam is chosen as 5e4 and ${\beta}_{1},{\beta}_{2}$ are default value in Adam optimizer. The setting of PCL without communication is exactly same with value propagation except the absence of communication network.
A.5 Consensus update in Algorithm 3.3
We now give details to derive the Consensus Update in Algorithm 3.3 with $\eta =1$ to ease the exposition. When $\eta \in [0,1)$, we just need to change variable and some notations, the result are almost same. Here we use the primal update as an example, the derivation of the dual update is the same.
In the main paper section 3, we have shown that when $\eta =1$, in the primal update, we basically solve following problem.
$$\begin{array}{cc}& \underset{{\{{\theta}_{{v}_{i}},{\theta}_{{\pi}^{i}}\}}_{i=1}^{N}}{\mathrm{min}}2{\mathbb{E}}_{s,a,s}[{\nu}^{*}(s,a)(\frac{1}{N}\sum _{i=1}^{N}({R}^{i}(s,a)+\gamma {V}_{i}({s}^{\prime}){V}_{i}(s)\lambda N\mathrm{log}{\pi}^{i}(s,a))]{\mathbb{E}}_{s,a,s}[{\nu}^{*2}(s,a)],\hfill \\ & s.t.,{\theta}_{{v}_{1}}=\mathrm{\dots}={\theta}_{{v}_{n}}.\hfill \end{array}$$  (12) 
here for simplicity we assume in the dual optimization, we have already find the optimal solution ${\nu}^{*}(s,a)$. It can be any approximated solution of $\stackrel{~}{\nu}(s,a)$ which does not affect the derivation of the update rule in primal optimization. In the later proof, we will show how this approximated solution affects the convergence rate.
When we optimize w.r.t. ${\theta}_{{v}^{i}}$, we basically we solve a nonconvex problem with the following form
$$\underset{x}{\mathrm{min}}f(x)=\sum _{i=1}^{N}{f}_{i}({x}_{i}),s.t.{x}_{1}=\mathrm{\dots}={x}_{N}$$  (13) 
Recall the definition of the nodeedge incidence matrix $A$: if $e\in \mathcal{E}$ and it connects vertex $i$ and $j$ with $i>j$, then ${A}_{ev}=1$ if $v=i$, ${A}_{ev}=1$ if $v=j$ and ${A}_{e}v=0$ otherwise. Thus by define $x={[{x}_{1},\mathrm{\dots},{x}_{N}]}^{\prime}$ we have a equivalent form of (13)
$$\underset{x}{\mathrm{min}}f(x)=\sum _{i=1}^{N}{f}_{i}({x}_{i}),s.t.,Ax=0$$  (14) 
Notice the update of ${\theta}_{{\pi}^{i}}$ is a special case of above formulation, since we do not have the constraint ${x}_{1}=,\mathrm{\dots},={x}_{N}$. Thus in the following, it suffice to analyze above formulation (14). We adapt the ProxPDA in Hong et al. [2017] to solve above problem. To keep the notation consistent with Hong et al. [2017], we consider a more general problem
$$\underset{x}{\mathrm{min}}f(x)=\sum _{i=1}^{N}{f}_{i}({x}_{i}),s.t.,Ax=b.$$ 
In the following we denote $\nabla f({x}^{t}):={[{({\nabla}_{{x}_{1}}f({x}_{1}))}^{T},\mathrm{\dots},{({\nabla}_{{x}_{N}}f({x}_{N}))}^{T}]}^{T}$ where the superscript ${}^{\prime}$ means transpose. We denote ${g}_{i}({x}_{i})$ as an estimator of ${\nabla}_{{x}_{i}}f({x}_{i})$ and $g(x)=[{g}_{1}({x}_{1}),\mathrm{\dots},{g}_{N}({x}_{N})]$.
The update rule of ProxPDA is
$${x}^{t+1}=\mathrm{arg}\underset{x}{\mathrm{min}}\u27e8g({x}^{t}),x{x}^{t}\u27e9+\u27e8{\mu}^{t},Axb\u27e9+\frac{\beta}{2}{\parallel Axb\parallel}^{2}+\frac{\beta}{2}{\parallel x{x}^{t}\parallel}_{{B}^{T}B}^{2}$$  (15) 
$${\mu}^{t+1}={\mu}^{t}+\beta (A{x}^{t+1}b)$$  (16) 
where $g({x}^{t})$ is an estimator of $\nabla f({x}^{t})$. The signed graph Laplacian matrix ${L}_{}$ is ${A}^{T}A$. Now we choose $B:=A$ as the signless incidence matrix. Using this choice of $B$, we have ${B}^{T}B={L}^{+}\in {\mathbb{R}}^{N\times N}$ which is the signless graph Laplacian whose $(i,i)$th diagonal entry is the degree of node $i$ and its $(i,j)$th entry is 1 if $e=(i,j)\in \mathcal{E}$, and 0 otherwise.
Thus
$$\begin{array}{cc}\hfill {x}^{t+1}=& \mathrm{arg}\underset{x}{\mathrm{min}}\u27e8g({x}^{t}),x{x}^{t}\u27e9+\u27e8{\mu}^{t},Axb\u27e9+\frac{\beta}{2}{x}^{T}{L}_{}x+\frac{\beta}{2}(x{x}^{t}){L}^{+}(x{x}^{t})\hfill \\ \hfill =& \mathrm{arg}\underset{x}{\mathrm{min}}\u27e8g({x}^{t}),x\u27e9+\u27e8{\mu}^{t},Axb\u27e9+\frac{\beta}{2}{x}^{T}({L}_{}+{L}^{+})x\beta {x}^{T}{L}^{+}{x}^{t}\hfill \\ \hfill =& \mathrm{arg}\underset{x}{\mathrm{min}}\u27e8g({x}^{t}),x\u27e9+\u27e8{\mu}^{t},Axb\u27e9+\beta {x}^{T}Dx\beta {x}^{T}{L}^{+}{x}^{t},\hfill \end{array}$$  (17) 
where $D=diag[{d}_{1},\mathrm{\dots},{d}_{N}]$ is the degree matrix, with ${d}_{i}$ denoting the degree of node $i$.
After simple algebra, we obtain
$${x}^{t+1}=\frac{1}{2}{D}^{1}{L}^{+}{x}^{t}\frac{1}{2\beta}{D}^{1}{A}^{T}{\mu}^{t}\frac{1}{2\beta}{D}^{1}g({x}^{t}),$$ 
which is the primal update rule of the consensus step in the algorithm 3.3 (notice here the stepsize is $1/\beta $)
Appendix B Convergence Proof of Value Propagation
B.1 Convergence on the primal update
In this section, we first give the convergence analysis of the value propagation (algorithm 3.3) on the primal update. To include the effected of the inexact solution of dual optimization problem, we denote $g({x}^{t})=\nabla f({x}^{t})+{\u03f5}_{t}$, where ${\u03f5}_{t}={\epsilon}_{t}+{\stackrel{~}{\epsilon}}_{t}$ is some error terms.

•
${\epsilon}_{t}$ is a zero mean random variable coming from the randomness of the stochastic gradient $g({x}^{t})$.

•
${\stackrel{~}{\epsilon}}_{t}$ comes from the approximated solution of $\stackrel{~}{\nu}$ in (12) or $\stackrel{~}{\rho}$ in (3.2) such that $\parallel {\nabla}_{{\theta}_{v}}L({\theta}_{V},{\theta}_{\pi},{\stackrel{~}{\theta}}_{\rho}){\nabla}_{{\theta}_{v}}L({\theta}_{V},{\theta}_{\pi},{\theta}_{\rho}^{*})\parallel \le {\stackrel{~}{\epsilon}}_{t}$ and $\parallel {\nabla}_{{\theta}_{\pi}}L({\theta}_{V},{\theta}_{\pi},{\stackrel{~}{\theta}}_{\rho}){\nabla}_{{\theta}_{\pi}}L({\theta}_{V},{\theta}_{\pi},{\theta}_{\rho}^{*})\parallel \le {\stackrel{~}{\epsilon}}_{t}.$
Before we begin the proof, we made some mild assumption on the function $f(x)$.
Assumption 2.
1. The function f(x) is differentiable and has Lipschitz continuous gradient, i.e.,
$$\parallel \nabla f(x)\nabla f(y)\parallel \le \parallel xy\parallel ,\forall x,y\in {\mathbb{R}}^{K}.$$ 
2. Further assume that ${A}^{T}A+{B}^{T}B\succcurlyeq I$. This assumption is always satisfied by our choice on A and B. We have ${A}^{T}A+{B}^{T}B\succcurlyeq D\succcurlyeq {\mathrm{min}}_{i}\{{d}_{i}\}I$
3. There exists a constant $\delta >0$ such that $\exists \underset{\xaf}{f}>\mathrm{\infty},s.t.,f(x)+\frac{\delta}{2}\parallel Axb{\parallel}^{2}\ge \underset{\xaf}{f},\forall x.$ This assumption is satisfied if we require the parameter space is bounded.
Lemma 1.
Suppose the assumption 2 is satisfied, we have following inequality holds
$$\frac{{\parallel {\mu}^{t+1}{\mu}^{t}\parallel}^{2}}{\beta}\le \frac{3{L}^{2}}{\beta {\sigma}_{\mathrm{min}}}{\parallel {x}^{t}{x}^{t1}\parallel}^{2}+\frac{3}{\beta}{\parallel {\u03f5}_{t1}{\u03f5}_{t}\parallel}^{2}+\frac{3\beta}{{\sigma}_{\mathrm{min}}}{\parallel {B}^{T}B\left(({x}^{t+1}{x}^{t})({x}^{t}{x}^{t1})\right)\parallel}^{2}$$  (18) 
Proof.
Using the optimality condition of (15), we obtain
$$\nabla f({x}^{t})+{\u03f5}^{t}+{A}^{T}{\mu}^{t}+\beta {A}^{T}(A{x}^{t+1}b)+\beta {B}^{T}B({x}^{t+1}{x}^{t})=0$$  (19) 
applying equation (16) we have
$${A}^{T}{\mu}^{t+1}=\nabla f({x}^{t})\beta {B}^{T}B({x}^{t+1}{x}^{t}).$$  (20) 
Note that from the fact that ${\mu}^{0}=0$, we have the variable lies in the column space of $A$.
$${\mu}^{r}=\beta \sum _{t=1}^{r}(A{x}^{t}b).$$ 
Let ${\sigma}_{\mathrm{min}}$ denote the smallest nonzero eigenvalue of ${A}^{T}A$, we have
$$\begin{array}{cc}& {\sigma}_{\mathrm{min}}^{1/2}\parallel {\mu}^{t+1}{\mu}^{t}\parallel \hfill \\ \hfill \le & \parallel A({\mu}^{t+1}{\mu}^{t})\parallel \hfill \\ \hfill \le & \parallel \nabla f({x}^{t}){\u03f5}_{t}\beta {B}^{T}B({x}^{t+1}{x}^{t})(\nabla f({x}^{t1}){\u03f5}_{t1}\beta {B}^{T}B({x}^{t}{x}^{t1}))\parallel \hfill \\ \hfill =& \parallel \nabla f({x}^{t1})\nabla f({x}^{t})+({\u03f5}_{t1}{\u03f5}_{t})\beta {B}^{T}B(({x}^{t+1}{x}^{t}({x}^{t}{x}^{t1})))\parallel .\hfill \end{array}$$  (21) 
Thus we have
$$\begin{array}{cc}& \frac{{\parallel {\mu}^{t+1}{\mu}^{t}\parallel}^{2}}{\beta}\hfill \\ & \le \frac{1}{\beta {\sigma}_{\mathrm{min}}^{1/2}}{\parallel \nabla f({x}^{t1})\nabla f({x}^{t})+({\u03f5}_{t1}{\u03f5}_{t})\beta {B}^{T}B(({x}^{t+1}{x}^{t}({x}^{t}{x}^{t1})))\parallel}^{2}\hfill \\ & \le \frac{3{L}^{2}}{\beta {\sigma}_{\mathrm{min}}}{\parallel {x}^{t}{x}^{t1}\parallel}^{2}+\frac{3}{\beta}{\parallel {\u03f5}_{t1}{\u03f5}_{t}\parallel}^{2}+\frac{3\beta}{{\sigma}_{\mathrm{min}}}{\parallel {B}^{T}B\left(({x}^{t+1}{x}^{t})({x}^{t}{x}^{t1})\right)\parallel}^{2},\hfill \end{array}$$  (22) 
where the second inequality holds from the fact that ${(a+b+c)}^{2}\le 3{a}^{2}+3{b}^{2}+3{c}^{2}$.
∎
Lemma 2.
Define ${L}_{\beta}\mathit{}\mathrm{(}{x}^{t}\mathrm{,}{\mu}^{t}\mathrm{)}\mathrm{=}f\mathit{}\mathrm{(}{x}^{t}\mathrm{)}\mathrm{+}\mathrm{\u27e8}{\mu}^{t}\mathrm{,}A\mathit{}x\mathrm{}b\mathrm{\u27e9}\mathrm{+}\frac{\beta}{\mathrm{2}}\mathit{}{\mathrm{\parallel}A\mathit{}x\mathrm{}b\mathrm{\parallel}}^{\mathrm{2}}\mathrm{+}\frac{\beta}{\mathrm{2}}\mathit{}{\mathrm{\parallel}x\mathrm{}{x}^{t}\mathrm{\parallel}}_{{B}^{T}\mathit{}B}$. Suppose assumptions are satisfied, then the following is true for the algorithm
$$\begin{array}{cc}& {L}_{\beta}({x}^{t+1},{\mu}^{t+1}){L}_{\beta}({x}^{t},{\mu}^{t})\hfill \\ \hfill \le & \frac{\beta}{2}{\parallel {x}^{t+1}{x}^{t}\parallel}^{2}+\frac{3{L}^{2}}{\beta {\sigma}_{\mathrm{min}}}{\parallel {x}^{t}{x}^{t1}\parallel}^{2}+\frac{3}{\beta}{\parallel {\u03f5}_{t1}{\u03f5}_{t}\parallel}^{2}+\frac{3\beta}{{\sigma}_{\mathrm{min}}}{\parallel {B}^{T}B\left(({x}^{t+1}{x}^{t})({x}^{t}{x}^{t1})\right)\parallel}^{2}\hfill \\ & +\u27e8{\u03f5}_{t},{x}^{t+1}{x}^{t}\u27e9\hfill \end{array}$$  (23) 
Proof.
By the Assumptions ${A}^{T}A+{B}^{T}B\ge I$, the objective function in (15) is strongly convex with parameter $\beta $.
Using the optimality condition of ${x}^{t+1}$ and strong convexity, we have for any $x$,
$$\begin{array}{cc}& {L}_{\beta}(x,{\mu}^{t})+\frac{\beta}{2}{\parallel x{x}^{t}\parallel}_{{B}^{T}B}^{2}({L}_{\beta}({x}^{t+1},{\mu}^{t})+\frac{\beta}{2}{\parallel {x}^{t+1}{x}^{t}\parallel}_{{B}^{T}B}^{2})\hfill \\ & \ge \u27e8\nabla {L}_{\beta}({x}^{t+1},{\mu}^{t})+\beta {B}^{T}B({x}^{t+1}{x}^{t}),x{x}^{t+1}\u27e9+\frac{\beta}{2}{\parallel {x}^{t+1}x\parallel}^{2}\hfill \end{array}$$  (24) 
Now we start to provide a upper bound of ${L}_{\beta}({x}^{t+1},{\mu}^{t+1}){L}_{\beta}({x}^{t},{\mu}^{t})$ .
$$\begin{array}{cc}& {L}_{\beta}({x}^{t+1},{\mu}^{t+1}){L}_{\beta}({x}^{t},{\mu}^{t})\hfill \\ \hfill =& {L}_{\beta}({x}^{t+1},{\mu}^{t+1}){L}_{\beta}({x}^{t+1},{\mu}^{t})+{L}_{\beta}({x}^{t+1},{\mu}^{t}){L}_{\beta}({x}^{t},{\mu}^{t})\hfill \\ \hfill \le & {L}_{\beta}({x}^{t+1},{\mu}^{t+1}){L}_{\beta}({x}^{t+1},{\mu}^{t})+{L}_{\beta}({x}^{t+1},{\mu}^{t})+\frac{\beta}{2}{\parallel {x}^{t+1}{x}^{t}\parallel}_{{B}^{T}B}^{2}{L}_{\beta}({x}^{t},{\mu}^{t})\hfill \\ \hfill \stackrel{\mathit{a}}{\le}& \frac{\parallel {\mu}^{t+1}{\mu}^{t}\parallel}{\beta}+\u27e8\nabla {L}_{\beta}({x}^{t+1},{\mu}^{t})+\beta {B}^{T}B({x}^{t+1}{x}^{t}),{x}^{t+1}{x}^{t}\u27e9\frac{\beta}{2}{\parallel {x}^{t+1}{x}^{t}\parallel}^{2}\hfill \\ \hfill \stackrel{\mathit{b}}{\le}& \frac{\beta}{2}{\parallel {x}^{t+1}{x}^{t}\parallel}^{2}+\frac{{\parallel {\mu}^{t+1}{\mu}^{t}\parallel}^{2}}{\beta}+\u27e8{\u03f5}_{t},{x}^{t+1}{x}^{t}\u27e9\hfill \\ \hfill \stackrel{\mathit{c}}{\le}& \frac{\beta}{2}{\parallel {x}^{t+1}{x}^{t}\parallel}^{2}+\frac{3{L}^{2}}{\beta {\sigma}_{\mathrm{min}}}{\parallel {x}^{t}{x}^{t1}\parallel}^{2}+\frac{3}{\beta}{\parallel {\u03f5}_{t1}{\u03f5}_{t}\parallel}^{2}\hfill \\ & +\frac{3\beta}{{\sigma}_{\mathrm{min}}}{\parallel {B}^{T}B\left(({x}^{t+1}{x}^{t})({x}^{t}{x}^{t1})\right)\parallel}^{2}+\u27e8{\u03f5}_{t},{x}^{t+1}{x}^{t}\u27e9,\hfill \end{array}$$  (25) 
where the inequality (a) holds from the update rule in (16) and a simple algebra from the expression of ${L}_{\beta}(x,\mu )$. Inequality (b) comes from the optimality condition of (15). Particularly, we have
$$g({x}^{t})+{A}^{T}{\mu}^{t}+\beta {A}^{T}(Axb)+\beta {B}^{T}B(x{x}^{t})=0$$ 
replace $g({x}^{t})$ by $\nabla f({x}^{t})+{\u03f5}_{t}$, we have the result. The inequality (c) holds using the Lemma 1.
∎
Lemma 3.
Suppose Assumption 2 is satisfied, then the following condition holds.
$$\begin{array}{cc}& \frac{\beta}{2}({\parallel A{x}^{t+1}b\parallel}^{2}+{\parallel {x}^{t+1}{x}^{t}\parallel}_{{B}^{T}B}^{2})\hfill \\ \hfill \le & \frac{L}{2}{\parallel {x}^{t+1}{x}^{t}\parallel}^{2}+\frac{L}{2}{\parallel {x}^{t}{x}^{t1}\parallel}^{2}+\frac{\beta}{2}({\parallel {x}^{t}{x}^{t1}\parallel}_{{B}^{T}B}^{2}+{\parallel A{x}^{t}b\parallel}^{2})\hfill \\ & \frac{\beta}{2}({\parallel ({x}^{t}{x}^{t1})({x}^{t+1}{x}^{t})\parallel}_{{B}^{T}B}^{2}+{\parallel A({x}^{t+1}{x}^{t})\parallel}^{2})\u27e8{\u03f5}_{t}{\u03f5}_{t1},{x}^{t+1}{x}^{t}\u27e9\hfill \end{array}$$  (26) 
Proof.
Using the optimality condition of ${x}^{t+1}$ and ${x}^{t}$ in the update rule in (15), we obtain
$$\u27e8g({x}^{t})+{A}^{T}{\mu}^{t}+\beta {A}^{T}(A{x}^{t+1}b)+\beta {B}^{T}B({x}^{t+1}{x}^{t}),{x}^{t+1}x\u27e9\le 0,\forall x$$  (27) 
and
$$\u27e8g({x}^{t1})+{A}^{T}{\mu}^{t1}+\beta {A}^{T}(A{x}^{t}b)+\beta {B}^{T}B({x}^{t}{x}^{t1}),{x}^{t}x\u27e9\le 0,\forall x$$  (28) 
Replacing $g({x}^{t})$ by $\nabla f({x}^{t})+{\u03f5}_{t}$ and $g({x}^{t1})$ by $\nabla f({x}^{t1})+{\u03f5}_{t1}$, and using the update rule (16)
$$\u27e8\nabla f({x}^{t})+{\u03f5}_{t}+{A}^{T}{\mu}^{t+1}+\beta {B}^{T}B({x}^{t+1}{x}^{t}),{x}^{t+1}x\u27e9\le 0,\forall x$$  (29) 
$$\u27e8\nabla f({x}^{t1})+{\u03f5}_{t1}+{A}^{T}{\mu}^{t}+\beta {B}^{T}B({x}^{t}{x}^{t1}),{x}^{t}x\u27e9\le 0,\forall x$$  (30) 
Now choose $x={x}^{t}$ in the first inequality and $x={x}^{t+1}$ in the second one, adding two inequalities together, we obtain
$$\u27e8\nabla f({x}^{t})\nabla f({x}^{t1})+{\u03f5}_{t}{\u03f5}_{t1}+{A}^{T}({\mu}^{t+1}{\mu}^{t})+\beta {B}^{T}B\left(({x}^{t+1}{x}^{t})({x}^{t}{x}^{t1})\right),{x}^{t+1}{x}^{t}\u27e9\le 0$$  (31) 
Rearranging above terms, we have
$$\begin{array}{cc}& \u27e8{A}^{T}({\mu}^{t+1}{\mu}^{t}),{x}^{t+1}{x}^{t}\u27e9\hfill \\ & \le \u27e8\nabla f({x}^{t})\nabla f({x}^{t1})+{\u03f5}_{t}{\u03f5}_{t1}+\beta {B}^{T}B\left(({x}^{t+1}{x}^{t})({x}^{t}{x}^{t1})\right),{x}^{t+1}{x}^{t}\u27e9\hfill \end{array}$$  (32) 
We first reexpress the lhs of above inequality.
$$\begin{array}{cc}& \u27e8{A}^{T}({\mu}^{t+1}{\mu}^{t}),{x}^{t+1}{x}^{t}\u27e9\hfill \\ \hfill =& \u27e8\beta {A}^{T}(A{x}^{t+1}b),{x}^{t+1}{x}^{t}\u27e9\hfill \\ \hfill =& \u27e8\beta (A{x}^{t+1}b),A{x}^{t+1}b(A{x}^{t}b)\u27e9\hfill \\ \hfill =& \beta {\parallel A{x}^{t+1}b\parallel}^{2}\beta \u27e8A{x}^{t+1}b,A{x}^{t}b\u27e9\hfill \\ \hfill =& \frac{\beta}{2}({\parallel A{x}^{t+1}b\parallel}^{2}{\parallel A{x}^{t}b\parallel}^{2}+{\parallel A({x}^{t+1}{x}^{t})\parallel}^{2})\hfill \end{array}$$  (33) 
Next, we bound the rhs of (32).
$$\begin{array}{cc}& \u27e8\nabla f({x}^{t})\nabla f({x}^{t1})+{\u03f5}_{t}{\u03f5}_{t1}+\beta {B}^{T}B\left(({x}^{t+1}{x}^{t})({x}^{t}{x}^{t1})\right),{x}^{t+1}{x}^{t}\u27e9\hfill \\ \hfill =& \u27e8\nabla f({x}^{t})\nabla f({x}^{t1})+{\u03f5}_{t}{\u03f5}_{t1},{x}^{t+1}{x}^{t}\u27e9\beta \u27e8{B}^{T}B\left(({x}^{t+1}{x}^{t})({x}^{t}{x}^{t1})\right),{x}^{t+1}{x}^{t}\u27e9\hfill \\ \hfill \stackrel{\mathit{a}}{\le}& \frac{L}{2}{\parallel {x}^{t+1}{x}^{t}\parallel}^{2}+\frac{1}{2L}{\parallel \nabla f({x}^{t})\nabla f({x}^{t1})\parallel}^{2}\u27e8{\u03f5}_{t}{\u03f5}_{t1},{x}^{t+1}{x}^{t}\u27e9\hfill \\ & \beta \u27e8{B}^{T}B\left(({x}^{t+1}{x}^{t})({x}^{t}{x}^{t1})\right),{x}^{t+1}{x}^{t}\u27e9\hfill \\ \hfill \stackrel{\mathit{b}}{\le}& \frac{L}{2}{\parallel {x}^{t+1}{x}^{t}\parallel}^{2}+\frac{L}{2}{\parallel {x}^{t}{x}^{t1}\parallel}^{2}\u27e8{\u03f5}_{t}{\u03f5}_{t1},{x}^{t+1}{x}^{t}\u27e9\hfill \\ & \beta \u27e8{B}^{T}B\left(({x}^{t+1}{x}^{t})({x}^{t}{x}^{t1})\right),{x}^{t+1}{x}^{t}\u27e9\hfill \\ \hfill =& \frac{L}{2}{\parallel {x}^{t+1}{x}^{t}\parallel}^{2}+\frac{L}{2}{\parallel {x}^{t}{x}^{t1}\parallel}^{2}\u27e8{\u03f5}_{t}{\u03f5}_{t1},{x}^{t+1}{x}^{t}\u27e9\hfill \\ \hfill +& \frac{\beta}{2}\left({\parallel {x}^{t}{x}^{t1}\parallel}_{{B}^{T}B}^{2}{\parallel {x}^{t+1}{x}^{t}\parallel}_{{B}^{T}B}^{2}{\parallel ({x}^{t}{x}^{t1})({x}^{t+1}{x}^{t})\parallel}_{{B}^{T}B}^{2}\right),\hfill \end{array}$$  (34) 
where the inequality (a) uses CauchySchwartz inequality, (b) holds from the smoothness assumption on $f$.
Combine all pieces together, we obtain
$$\begin{array}{cc}& \frac{\beta}{2}({\parallel A{x}^{t+1}b\parallel}^{2}+{\parallel {x}^{t+1}{x}^{t}\parallel}_{{B}^{T}B}^{2})\hfill \\ \hfill \le & \frac{L}{2}{\parallel {x}^{t+1}{x}^{t}\parallel}^{2}+\frac{L}{2}{\parallel {x}^{t}{x}^{t1}\parallel}^{2}+\frac{\beta}{2}({\parallel {x}^{t}{x}^{t1}\parallel}_{{B}^{T}B}^{2}+{\parallel A{x}^{t}b\parallel}^{2})\hfill \\ & \frac{\beta}{2}({\parallel ({x}^{t}{x}^{t1})({x}^{t+1}{x}^{t})\parallel}_{{B}^{T}B}^{2}+{\parallel A({x}^{t+1}{x}^{t})\parallel}^{2})\u27e8{\u03f5}_{t}{\u03f5}_{t1},{x}^{t+1}{x}^{t}\u27e9\hfill \end{array}$$  (35) 
∎
Same with Hong et al. [2017], we define the potential function
$${P}_{c,\beta}({x}^{t+1},{x}^{t},{\mu}^{t+1})={L}_{\beta}({x}^{t+1},{\mu}^{t+1})+\frac{c\beta}{2}({\parallel A{x}^{t+1}b\parallel}^{2}+{\parallel {x}^{t+1}{x}^{t}\parallel}_{{B}^{T}B}^{2})$$  (36) 
Lemma 4.
If Assumption 2 holds, we have following
$$\begin{array}{cc}& {P}_{c,\beta}({x}^{t+1},{x}^{t},{\mu}^{t+1})\hfill \\ \hfill \le & {P}_{c,\beta}({x}^{t},{x}^{t1},{\mu}^{t})(\frac{\beta}{2}\frac{cL}{2}\frac{2c+1}{2}){\parallel {x}^{t+1}{x}^{t}\parallel}^{2}+(\frac{3{L}^{2}}{\beta {\sigma}_{\mathrm{min}}}+\frac{cL}{2}){\parallel {x}^{t1}{x}^{t}\parallel}^{2}\hfill \\ & (\frac{c\beta}{2}\frac{3\beta \parallel {B}^{T}B\parallel}{{\sigma}_{\mathrm{min}}}){\parallel ({x}^{t+1}{x}^{t})({x}^{t}{x}^{t1})\parallel}_{{B}^{T}B}^{2}+(\frac{c}{4}+\frac{\beta}{3}){\parallel {\u03f5}_{t1}{\u03f5}_{t}\parallel}^{2}+\frac{1}{2}{\parallel {\u03f5}_{t}\parallel}^{2}\hfill \end{array}$$  (37) 
Proof.
$$\begin{array}{cc}& {P}_{c,\beta}({x}^{t+1},{x}^{t},{\mu}^{t+1})\hfill \\ \hfill \le & {L}_{\beta}({x}^{t},{\mu}^{t})+\frac{c\beta}{2}({\parallel {x}^{t}{x}^{t1}\parallel}_{{B}^{T}B}+{\parallel A{x}^{t}b\parallel}^{2})(\frac{\beta}{2}\frac{cL}{2}){\parallel {x}^{t+1}{x}^{t}\parallel}^{2}\hfill \\ \hfill +& (\frac{3{L}^{2}}{\beta {\sigma}_{\mathrm{min}}}+\frac{cL}{2}){\parallel {x}^{t1}{x}^{t}\parallel}^{2}(\frac{c\beta}{2}\frac{3\beta \parallel {B}^{T}B\parallel}{{\sigma}_{\mathrm{min}}}){\parallel ({x}^{t+1}{x}^{t})({x}^{t}{x}^{t1})\parallel}_{{B}^{T}B}^{2}\hfill \\ \hfill +& \frac{3}{\beta}{\parallel {\u03f5}_{t}{\u03f5}_{t1}\parallel}^{2}+\u27e8{\u03f5}_{t},{x}^{t+1}{x}^{t}\u27e9c\u27e8{\u03f5}_{t}{\u03f5}_{t1},{x}^{t+1}{x}^{t}\u27e9\hfill \\ \hfill \le & {P}_{c,\beta}({x}^{t},{x}^{t1},{\mu}^{t})(\frac{\beta}{2}\frac{cL}{2}){\parallel {x}^{t+1}{x}^{t}\parallel}^{2}+(\frac{3{L}^{2}}{\beta {\sigma}_{\mathrm{min}}}+\frac{cL}{2}){\parallel {x}^{t1}{x}^{t}\parallel}^{2}\hfill \\ & (\frac{c\beta}{2}\frac{3\beta \parallel {B}^{T}B\parallel}{{\sigma}_{\mathrm{min}}}){\parallel ({x}^{t+1}{x}^{t})({x}^{t}{x}^{t1})\parallel}_{{B}^{T}B}^{2}+\frac{c}{4}{\parallel {\u03f5}_{t1}{\u03f5}_{t}\parallel}^{2}\hfill \\ & +c{\parallel {x}^{t+1}{x}^{t}\parallel}^{2}+\frac{1}{2}{\parallel {\u03f5}_{t}\parallel}^{2}+\frac{1}{2}{\parallel {x}^{t+1}{x}^{t}\parallel}^{2}+\frac{\beta}{3}{\parallel {\u03f5}_{t}{\u03f5}_{t1}\parallel}^{2}\hfill \\ \hfill =& {P}_{c\beta}({x}^{t},{x}^{t1},{\mu}^{t})(\frac{\beta}{2}\frac{cL}{2}\frac{2c+1}{2}){\parallel {x}^{t+1}{x}^{t}\parallel}^{2}+(\frac{3{L}^{2}}{\beta {\sigma}_{\mathrm{min}}}+\frac{cL}{2}){\parallel {x}^{t1}{x}^{t}\parallel}^{2}\hfill \\ & (\frac{c\beta}{2}\frac{3\beta \parallel {B}^{T}B\parallel}{{\sigma}_{\mathrm{min}}}){\parallel ({x}^{t+1}{x}^{t})({x}^{t}{x}^{t1})\parallel}_{{B}^{T}B}^{2}+(\frac{c}{4}+\frac{\beta}{3}){\parallel {\u03f5}_{t1}{\u03f5}_{t}\parallel}^{2}+\frac{1}{2}{\parallel {\u03f5}_{t}\parallel}^{2}\hfill \end{array}$$  (38) 
where the second inequality holds from the CauchySchwartz inequality.
We require that
$$\frac{c\beta}{2}\frac{3\beta \parallel {B}^{T}B\parallel}{{\sigma}_{\mathrm{min}}}\ge 0,$$ 
which is satisfied when
$$c\ge \frac{6\parallel {B}^{T}B\parallel}{{\sigma}_{\mathrm{min}}}$$  (39) 
We further require
$$(\frac{\beta}{2}\frac{cL}{2}\frac{2c+1}{2})\ge (\frac{3{L}^{2}}{\beta {\sigma}_{\mathrm{min}}}+\frac{cL}{2}),$$ 
which will be used later in the telescoping.
Thus we require
$$\beta \ge 2cL+2c+1+\frac{6{L}^{2}}{\beta {\sigma}_{\mathrm{min}}}.$$  (40) 
and choose $\beta \ge CL+\frac{2c+1}{2}+\frac{1}{2}\sqrt{{(2cL+2c+1)}^{2}+\frac{24{L}^{2}}{{\sigma}_{\mathrm{min}}}}$ ∎
Now we do summation over both side of (37) and have
$$\begin{array}{cc}& \sum _{t=1}^{T}[(\frac{\beta}{2}\frac{cL}{2}\frac{2c+1}{2}){\parallel {x}^{t+1}{x}^{t}\parallel}^{2}(\frac{3{L}^{2}}{\beta {\sigma}_{\mathrm{min}}}+\frac{cL}{2}){\parallel {x}^{t1}{x}^{t}\parallel}^{2}]\hfill \\ & +\sum _{t=1}^{T}(\frac{c\beta}{2}\frac{3\beta \parallel {B}^{T}B\parallel}{{\sigma}_{\mathrm{min}}}){\parallel ({x}^{t+1}{x}^{t})({x}^{t}{x}^{t1})\parallel}_{{B}^{T}B}^{2}\hfill \\ \hfill \le & {P}_{c\beta}({x}^{1},{x}^{0},{\mu}^{0}){P}_{c\beta}({x}^{T+1},{x}^{T},{\mu}^{T})+\sum _{t=1}^{T}[(\frac{c}{4}+\frac{\beta}{3}){\parallel {\u03f5}_{t1}{\u03f5}_{t}\parallel}^{2}+\frac{1}{2}{\parallel {\u03f5}_{t}\parallel}^{2}]\hfill \end{array}$$  (41) 
rearrange terms of above inequality.
$$\begin{array}{cc}& \sum _{t=1}^{T1}(\frac{\beta}{2}\frac{cl}{2}\frac{2c+1}{2}\frac{3{L}^{2}}{\beta {\sigma}_{\mathrm{min}}}\frac{cl}{2}){\parallel {x}^{t+1}{x}^{t}\parallel}^{2}+(\frac{\beta}{2}\frac{cl}{2}\frac{2c+1}{2}){\parallel {x}^{T+1}{x}^{T}\parallel}^{2}\hfill \\ \hfill \le & {P}_{c\beta}({x}^{1},{x}^{0},{\mu}^{0}){P}_{c\beta}({x}^{T+1},{x}^{T},{\mu}^{T})+(\frac{3{L}^{2}}{\beta {\sigma}_{\mathrm{min}}}+\frac{cl}{2}){\parallel {x}^{1}{x}^{0}\parallel}^{2}+\sum _{t=1}^{T}[(\frac{c}{4}+\frac{\beta}{3}){\parallel {\u03f5}_{t1}{\u03f5}_{t}\parallel}^{2}+\frac{1}{2}{\parallel {\u03f5}_{t}\parallel}^{2}]\hfill \end{array}$$  (42) 
Next we show ${P}_{c\beta}$ is lower bounded
The following lemma is from Lemma 3.5 in Hong [2016], we present here for completeness.
Lemma 5.
Suppose Assumption 2 are satisfied, and $\mathrm{(}c\mathrm{,}\beta \mathrm{)}$ are chosen according to (40) and (39). Then the following state holds true
$\exists \underset{\xaf}{P}s.t.,{P}_{c\beta}({x}^{t+1},{x}^{t},{\mu}^{t+1})\ge \underset{\xaf}{P}>\mathrm{\infty}$
Proof.
$$\begin{array}{cc}\hfill {L}_{\beta}({x}^{t+1},{\mu}^{t+1})=& f({x}^{t+1})+\u27e8{\mu}^{t+1},A{x}^{t+1}b\u27e9+\frac{\beta}{2}{\parallel A{x}^{t+1}b\parallel}^{2}\hfill \\ \hfill =& f({x}^{t+1})+\frac{1}{\beta}\u27e8{\mu}^{t+1},{\mu}^{t+1}{\mu}^{t}\u27e9+\frac{\beta}{2}{\parallel A{x}^{t+1}b\parallel}^{2}\hfill \\ \hfill =& f({x}^{t+1})+\frac{1}{2\beta}({\parallel {\mu}^{t+1}\parallel}^{2}{\parallel {\mu}^{t}\parallel}^{2}+{\parallel {\mu}^{t+1}{\mu}^{t}\parallel}^{2})+\frac{\beta}{2}{\parallel A{x}^{t+1}b\parallel}^{2}.\hfill \end{array}$$  (43) 
Sum over both side, we obtain
$$\sum _{t=1}^{T}{L}_{\beta}({x}^{t+1},{\mu}^{t+1})=\sum _{t=1}^{T}\left(f({x}^{t+1})+\frac{\beta}{2}{\parallel A{x}^{t+1}b\parallel}^{2}+\frac{1}{2\beta}{\parallel {\mu}^{t+1}{\mu}^{t}\parallel}^{2}\right)+\frac{1}{2\beta}({\parallel {\mu}^{T+1}\parallel}^{2}{\parallel {\mu}^{1}\parallel}^{2})$$  (44) 
By assumption 2, above sum is lower bounded, which implies that the sum of the potential function is also lower bounded (Recall ${P}_{c,\beta}({x}^{t+1},{x}^{t},{\mu}^{t+1})={L}_{\beta}({x}^{t+1},{\mu}^{t+1})+\frac{c\beta}{2}({\parallel A{x}^{t+1}b\parallel}^{2}+{\parallel {x}^{t+1}{x}^{t}\parallel}_{{B}^{T}B}^{2})$ ). Thus we have
$${P}_{c\beta}({x}^{t+1},{x}^{t},{\mu}^{t+1})>\mathrm{\infty},\forall t>0$$ 
∎
In the next step, we are ready to provide the convergence rate. Following Hong [2016], we define the convergence criteria
$$Q({x}^{t+1},{\mu}^{t+1})={\parallel \nabla {L}_{\beta}({x}^{t+1},{\mu}^{t})\parallel}^{2}+{\parallel A{x}^{t+1}b\parallel}^{2}$$  (45) 
It is easy to see, when $Q({x}^{t+1},{\mu}^{t})=0$, $\nabla f(x)+{A}^{T}\mu =0$ and $Ax=b$, which are KKT condition of the problem.
$$\begin{array}{cc}& {\parallel \nabla {L}_{\beta}({x}^{t},{\mu}^{t1})\parallel}^{2}\hfill \\ \hfill =& {\parallel \nabla f({x}^{t})\nabla f({x}^{t1})+{\u03f5}_{t}{\u03f5}_{t1}+{A}^{T}({\mu}^{t+1}{\mu}^{t})+\beta {B}^{T}B({x}^{t+1}{x}^{t})\parallel}^{2}\hfill \\ \hfill \le & 4{L}^{2}{\parallel {x}^{t}{x}^{t1}\parallel}^{2}+4{\parallel {\mu}^{t+1}{\mu}^{t}\parallel}^{2}\parallel {A}^{T}A\parallel +4{\beta}^{2}{\parallel {B}^{T}B({x}^{t+1}{x}^{t})\parallel}^{2}+4{\parallel {\u03f5}_{t}{\u03f5}_{t1}\parallel}^{2}\hfill \end{array}$$  (46) 
Using the proof in Lemma 1, we know there exist two positive constants c1 c2 c3 c4
$$Q({x}^{t},{\mu}^{t1})\le {c}_{1}{\parallel {x}^{t}{x}^{t+1}\parallel}^{2}+{c}_{2}{\parallel {x}^{t}{x}^{t1}\parallel}^{2}+{c}_{3}{\parallel {B}^{T}B\left(({x}^{t+1}{x}^{t})({x}^{t}{x}^{t1})\right)\parallel}^{2}+{c}_{4}{\parallel {\u03f5}_{t}{\u03f5}_{t1}\parallel}^{2}.$$ 
Using Lemma 4, we know there must exist a constant $\kappa $ such that
$$\begin{array}{cc}& \sum _{t=1}^{T1}Q({x}^{t},{\mu}^{t1})\hfill \\ \hfill \le & \kappa ({P}_{c\beta}({x}^{1},{x}^{0},{\mu}^{0}){P}_{c\beta}({x}^{T+1},{x}^{T},{\mu}^{T})+\sum _{t=1}^{T}[(\frac{c}{4}+\frac{\beta}{3}){\parallel {\u03f5}_{t1}{\u03f5}_{t}\parallel}^{2}+\frac{1}{2}{\parallel {\u03f5}_{t}\parallel}^{2}])+{c}_{4}\sum _{t=1}^{T1}{\parallel {\u03f5}_{t}{\u03f5}_{t1}\parallel}^{2}\hfill \\ \hfill \le & \kappa ({P}_{c\beta}({x}^{1},{x}^{0},{\mu}^{0})\underset{\xaf}{P}+\sum _{t=1}^{T}[(\frac{c}{4}+\frac{\beta}{3}){\parallel {\u03f5}_{t1}{\u03f5}_{t}\parallel}^{2}+\frac{1}{2}{\parallel {\u03f5}_{t}\parallel}^{2}])+{c}_{4}\sum _{t=1}^{T1}{\parallel {\u03f5}_{t}{\u03f5}_{t1}\parallel}^{2}\hfill \end{array}$$  (47) 
Divide both side by $T$ and take expectation
$$\begin{array}{cc}\hfill \frac{1}{T}\mathbb{E}\sum _{t=1}^{T}Q({x}^{t},{\mu}^{t1})\le \frac{1}{T}\kappa ({P}_{c\beta}({x}_{1},{x}^{0},{\mu}^{0})\underset{\xaf}{P})+& \frac{\kappa}{T}[\sum _{t=1}^{T}(\frac{c}{4}+\frac{\beta}{3})\mathbb{E}{\parallel {\u03f5}_{t1}{\u03f5}_{t}\parallel}^{2}+\frac{1}{2}{\parallel {\u03f5}_{t}\parallel}^{2}]\hfill \\ \hfill +& \frac{{c}_{4}}{T}\sum _{t=1}^{T1}\mathbb{E}{\parallel {\u03f5}_{t}{\u03f5}_{t1}\parallel}^{2}\hfill \end{array}$$  (48) 
Now we bound the R.H.S. of above equation.
Recall we choose the minibatch size $\sqrt{T}$, ${\u03f5}_{t}={\epsilon}_{t}+{\stackrel{~}{\epsilon}}_{t}$ and ${\epsilon}_{t}\le {c}_{1}/\sqrt{T}$
$${\parallel {\u03f5}_{t1}{\u03f5}_{t}\parallel}^{2}\le 2\mathbb{E}({\parallel {\u03f5}_{t1}\parallel}^{2}+{\parallel {\u03f5}_{t}\parallel}^{2})\le 4\mathbb{E}({\parallel {\epsilon}_{t}\parallel}^{2}+{\parallel {\stackrel{~}{\epsilon}}_{t}\parallel}^{2}+{\parallel {\epsilon}_{t1}\parallel}^{2}+{\parallel {\stackrel{~}{\epsilon}}_{t1}\parallel}^{2})\le \frac{8{c}_{1}}{T}+\frac{8{\sigma}^{2}}{T}$$  (49) 
Similarly we can bound ${\parallel {\u03f5}_{t}\parallel}^{2}$. Combine all pieces together, we obtain
$$\frac{1}{T}\mathbb{E}\sum _{t=1}^{T}Q({x}^{t},{\mu}^{t1})\le \frac{1}{T}\kappa ({P}_{c\beta}({x}_{1},{x}^{0},{\mu}^{0})\underset{\xaf}{P})+\frac{1}{T}(\kappa {c}_{5}+{c}_{6}{\sigma}^{2}),$$ 
where ${c}_{5},{c}_{6}$ are some universal positive constants.
Notice ${\mathrm{min}}_{t}\mathbb{E}Q({x}^{t},{\mu}^{t1})\le \frac{1}{T}\mathbb{E}{\sum}_{t=1}^{T}Q({x}^{t},{\mu}^{t1})$, we have ${\mathrm{min}}_{t}\mathbb{E}Q({x}^{t},{\mu}^{t1})\le (C+{\sigma}^{2})/T$ where $C$ is a universal positive constant.
B.2 Convergence on the dual update
If the dual objective function is nonconvex, we just follow the exact analysis in our proof on the primal problem. Notice the analysis on the dual update is easier than primal one, since we do not have the error term ${\stackrel{~}{\u03f5}}_{t}$. Therefore, we have the algorithm converges to stationary solution with rate $\mathcal{O}(1/T)$in criteria $Q$.