### Abstract

The recent success of single-agent reinforcement learning (RL) encourages theexploration of multi-agent reinforcement learning (MARL), which is morechallenging due to the interactions among different agents. In this paper, weconsider a voting-based MARL problem, in which the agents vote to make groupdecisions and the goal is to maximize the globally averaged returns. To thisend, we formulate the MARL problem based on the linear programming form of thepolicy optimization problem and propose a distributed primal-dual algorithm toobtain the optimal solution. We also propose a voting mechanism through whichthe distributed learning achieves the same sub-linear convergence rate ascentralized learning. In other words, the distributed decision making does notslow down the global consensus to optimal. We also verify the convergence ofour proposed algorithm with numerical simulations and conduct case studies inpractical multi-agent systems.

### Quick Read (beta)

# Voting-Based Multi-Agent Reinforcement Learning

###### Abstract

The recent success of single-agent reinforcement learning (RL) encourages the exploration of multi-agent reinforcement learning (MARL), which is more challenging due to the interactions among different agents. In this paper, we consider a voting-based MARL problem, in which the agents vote to make group decisions and the goal is to maximize the globally averaged returns. To this end, we formulate the MARL problem based on the linear programming form of the policy optimization problem and propose a distributed primal-dual algorithm to obtain the optimal solution. We also propose a voting mechanism through which the distributed learning achieves the same sub-linear convergence rate as centralized learning. In other words, the distributed decision making does not slow down the global consensus to optimal. We also verify the convergence of our proposed algorithm with numerical simulations and conduct case studies in practical multi-agent systems.

## 1 Introduction

Reinforcement learning (RL) aims at maximizing a cumulative reward by selecting a sequence of optimal actions to interact with a stochastic unknown environment [36], where the dynamics is usually modeled as a Markov decision process (MDP). The recent success of single-agent RL in various fields, e.g., video games [34, 35], data analysis [27], and system control [41, 40], encourages the extension to multi-agent reinforcement learning (MARL), which, however, is more challenging since each agent interacts with not only the environment but also the other agents. In this paper, we focus on the collaborative MARL setting where the aim is to maximize the globally averaged return of all agents in the environment. In addition, we assume that each agent can only observe its own reward, which may differ across different agents.

For collaborative MARL, it is critical to specify a proper collaboration protocol so as to promote efficient cooperations among the agents. One tempting choice is to allow mutual communication among neighboring agents for coordination [43, 38, 26, 20, 28]. However, such communication requires the agents to be connected for information exchange. In contrast, in this paper, we analyze a different approach consisting of a voting mechanism, where the agents vote to determine the joint action with no mutual communication, such that it is topology independent. Such a voting-based architecture finds wide applications in many practical multi-agent systems, e.g., vehicle networks [3], sensor networks [30, 17], and social networks [16].

Our primary interest is to develop a sample-efficient model-free distributed MARL algorithm built upon voting-based coordinations in the presence of a generative model of the MDP [10, 37, 4]. In particular, we consider the underlying MDP unknown but having access to a sampling oracle which takes an arbitrary state-action pair $(i,a)$ as input and generates the next state $j$ with probability ${p}_{ij}(a)$, along with an immediate reward for each individual agent. Such a simulator-defined MDP has been studied in existing literatures with a single RL agent, including the model-based RL [10, 37, 4] and model-free RL [18, 19]. Our problem formulation is based on the saddle point formulation of policy optimization in MDPs [33]. In this context, we propose a distributed MARL algorithm to estimate the policy and value function in MARL through primal-dual iterations. Each pair of the local primal and dual variables correspond to the local value and voting policy of each agent, respectively. In this way, obtaining the optimal voting policy of each agent is equivalent to obtaining the optimal value of its dual variable. We then propose a voting mechanism to specify how local votes determine the global action to take, by which optimizing the local voting policy of each agent distributively is equivalent to optimizing the global acting policy of all the agents. The analysis and simulations in this paper will yield insights behind voting-related behaviors and motivate us to understand how voting mechanism works in a distributed, collaborative and interactive system.

#### Related Work.

Many existing works on model-free MARL is based on the framework of Markov games [32, 21, 22, 42, 2] or temporal-difference RL [43, 12, 15, 23, 31, 13]. Specifically, the study of MARL in the context of Markov game mainly adapts a stochastic game into the MARL formulation, which applies to both collaborative and competitive settings. The representatives include the cooperative games [32], zero-sum stochastic games [21], general-sum stochastic games [22], decentralized Q-Learning [2], and the recent mean-field MARL [42]. Alternatively, the study of MARL in the context of temporal-difference RL mainly origins from dynamic programming, which learns by following the Bellman’s equation. For example, [12, 15, 23, 31, 13] studied deep MARL that uses deep neural networks as function approximators; more recently, [43] studied the convergence of the actor-critic algorithm with linear function approximators in a MARL system consist of networked agents. However, the above MARL approaches either focus on empirical performance without theoretical guarantees [12, 15, 23, 31, 13] or only provide asymptotic convergence without providing an explicit sample complexity analysis towards an $\u03f5$-optimal solution [43].

On the other hand, there are two research lines in existing literature that focus on the saddle point formulation of RL. One research line studies the saddle point formulation resulted from the fixed-point problem of policy evaluation [26, 9, 11, 38, 20], i.e., learning the value function of a fixed policy. Among others, [20, 38] provided the sample complexity analysis of policy evaluation in the context of MARL, where the policies of all agents are fixed. In contrast, the other research line, including this paper, focuses on the saddle point formulation resulted from the policy optimization problem [8, 39, 7], where the policy is continuously updated towards the optimal one, making the analysis substantially more challenging than that for the policy evaluation. In the single-agent setting, our work is closely related to [39]. However, to the best of our knowledge, our work is the first to consider solving a saddle-point policy optimization in the context of MARL, which considers the coordination among multiple agents. Moreover, we also provide numerical simulations and case studies for verification, while previous works mainly focus on theoretical analysis [8, 39, 7]. Finally, compared with the widely considered communication-based coordination in MARL [43, 26, 38, 20], our proposed voting-based coordination is more promising to be applied in many voting-based applications [3, 30, 17, 16]. Moreover, it is also interesting to study the voting mechanism and the related behaviors under competitive settings in the future work.

#### Main Contribution.

Our main contribution is three-fold: 1) we formulate the MARL problem based on the liner programming form of the policy optimization problem and propose a distributed primal-dual algorithm to obtain the optimal solution by exploiting the linear duality between the value and policy of the Bellman equation in the context of MARL; 2) we propose to coordinate the agents through voting, which is topology independent, and propose a voting mechanism through which the distributed learning achieves the same sub-linear convergence as centralized learning. In other words, the proposed distributed decision-making process does not slow down the global consensus to optimal; 3) our proposed algorithm and analysis covers the single-agent learning as a special case, which makes it more general. We also verify the convergence of our proposed algorithm through numerical simulations and demonstrate practical applications to justify the learning effectiveness.

## 2 Problem Formulation

### 2.1 Multi-Agent AMDP

We focus on the infinite-horizon average-reward Markov decision process (AMDP) as in [39, 7], which aims at optimizing the average-per-time-step rewards over an infinite decision sequence. Existing literatures usually model RL as a discounted MDP, which aims at maximizing the discounted cumulative reward with a discount factor $\gamma \in (0,1)$. However, the discounted MDP is indeed an approximation to the infinite-horizon undiscounted MDP [39]. Hence, in this paper, we do not assume that the future rewards are discounted, but focus on the AMDP under fast mixing and stationary properties, which are defined in Sec. 4. The multi-agent AMDP can be described as

$$\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{P},{\left\{{\mathcal{R}}_{m}\right\}}_{m=1}^{M}),$$ | (1) |

where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $\mathcal{P}=\{{P}_{a}(s,{s}^{\prime})|s,{s}^{\prime}\in \mathcal{S},a\in \mathcal{A}\}$ is the collection of state-to-state transition probabilities, and ${\left\{{\mathcal{R}}_{m}\right\}}_{m=1}^{M}$ is the collection of local reward functions with $M$ the number of agents. Moreover, we consider the reward functions of the agents may differ from each other and are private to each corresponding agent.

The considered MARL system selects the action to take according to the votes from local agents. Each agent determines its vote individually without communicating with the others. Specifically, at each time step $t$, the MARL system works as follows: 1) all agents observe the state ${s}_{t}\in \mathcal{S}$; 2) each agent votes for the action ${a}_{t}$ to take under ${s}_{t}$; 3) the system determines the action ${a}_{t}$ to take according to the votes; 4) the system executes ${a}_{t}$ and returns the immediate rewards ${\left\{{r}_{{a}_{t}}^{m}\right\}}_{m=1}^{M}$ to each corresponding agent; 5) the system shifts to a new state ${s}_{t+1}\in \mathcal{S}$ with the probability $P({s}_{t+1}|{s}_{t},{a}_{t})$ and starts the next iteration.

We denote the local voting policy of each agent as ${\pi}^{m}\in \mathrm{\Xi}\subset {\mathbb{R}}^{S\times A},m\in \mathcal{M}$, which is a randomized stationary policy with $\mathrm{\Xi}$ consisting of non-negative matrices whose $(s,a)$-th entry $\pi (s,a)$ denotes the probability of taking action $a$ at state $s$. The global acting policy to determine the joint action to take is denoted as ${\pi}^{g}\in \mathrm{\Xi}\subset {\mathbb{R}}^{S\times A}$. Indeed, the global acting policy is determined by the local voting policies jointly, which is specified by the voting mechanism discussed later.

### 2.2 Multi-Agent Policy Optimization

The multi-agent policy optimization aims at improving the global acting policy by maximizing the sum of local average-rewards, which can be written as

$$\underset{{\pi}^{g}}{\mathrm{max}}\{{\overline{v}}^{{\pi}^{g}}=\underset{T\to \mathrm{\infty}}{lim}{\mathbb{E}}^{{\pi}^{g}}\left[\frac{1}{T}\sum _{t=1}^{T}\sum _{m=1}^{M}{r}_{{i}_{t}{i}_{t+1}}^{m}({a}_{t})\right|{i}_{1}=i],i\in \mathcal{S}\},$$ | (2) |

where ${\mathbb{E}}^{{\pi}^{g}}[\cdot ]$ is the expectation over all the state-action trajectories generated by the MARL system when following the acting policy ${\pi}^{g}$. According to the theory of dynamic programming [33, 5], the value ${\overline{v}}^{*}$ is the optimal average reward to problem (2) if and only if it satisfies the following Bellman equation:

$$\begin{array}{c}\hfill {\overline{v}}^{*}+{v}^{*}(i)=\underset{a\in \mathcal{A}}{\mathrm{max}}\left\{\sum _{j\in \mathcal{S}}{p}_{ij}(a){v}^{*}(j)+\sum _{j\in \mathcal{S}}{p}_{ij}(a)\sum _{m\in \mathcal{M}}{r}_{ij}^{m}(a)\right\},\forall i\in \mathcal{S},\end{array}$$ | (3) |

where ${p}_{ij}(a)$ is the transition probability from state $i$ to state $j$ after taking the action $a$ and ${v}^{*}(i)$ is referred to as the difference-of-value function [39, 7]. Note that there exist infinite many solutions of ${v}^{*}(i)$, e.g., by adding constant shifts, which does not affect our analysis.

### 2.3 Saddle Point Formulation

The multi-agent policy optimization problem (2) and the Bellman equation (3) admit linear programming forms [39, 7], which are dual to each other and can be formulated as the following minimax problem:

$$\underset{\bm{v}\in \mathcal{V}}{\mathrm{min}}\underset{\bm{\mu}\in \mathcal{U}}{\mathrm{max}}\sum _{a\in \mathcal{A}}{({\bm{\mu}}_{a}^{g})}^{\top}\left(({P}_{a}-I){\bm{v}}^{g}+\sum _{m\in \mathcal{M}}{\mathbf{r}}_{a}^{m}\right),$$ | (4) |

where ${\bm{v}}^{g}$ and ${\bm{\mu}}^{g}$ are the global primal and dual variables, respectively, with $\mathcal{V}$ and $\mathcal{U}$ their search spaces to be specified later, ${P}_{a}\in {\mathcal{R}}^{|\mathcal{S}|\times |\mathcal{S}|}$ is the MDP transition matrix under action $a$ with its $(i,j)$-th entry denoted as ${p}_{ij}(a)$, and ${\bm{r}}_{a}^{m}\in {\mathcal{R}}^{|\mathcal{S}|}$ is the expected state-transition reward under action $a$ with ${r}_{i,a}^{m}={\sum}_{j\in \mathcal{S}}{p}_{ij}(a){r}_{ij}^{m}(a),\forall i\in \mathcal{S}$.

It is known that the basis of the optimal dual variable ${\bm{\mu}}^{*}\in {\mathcal{R}}^{|\mathcal{S}||\mathcal{A}|}$ corresponds to an optimal deterministic policy [33], which can be obtained as ${\pi}_{i,a}^{*}={\mu}_{i,a}^{*}/{\parallel {\bm{\mu}}^{*}\parallel}_{1}$. Therefore, obtaining the optimal acting policy ${\pi}^{g}$ is equivalent to obtaining its corresponding optimal dual variable ${\bm{\mu}}^{*}$, which is our focus in this paper.

## 3 Voting-Based Multi-Agent Reinforcement Learning

In this section, we propose a voting mechanism to specify how local votes determine the global action. Then, we prove that the voting mechanism enables the update on the global acting policy to be equivalent to the update on the distributed voting policies, such that problem (4) can be solved in a distributed manner, which leads to our proposed distributed primal-dual learning algorithm.

### 3.1 Voting Mechanism

We introduce one pair of local primal and dual variables for each local agent $m\in \mathcal{M}$, denoted as ${v}_{m}$ and ${\mu}_{m}$. The voting mechanism takes the form:

$${\mu}_{g}(a|i)\propto \mathrm{exp}\left\{\sum _{m=1}^{M}\mathrm{log}\left({\mu}_{m}(a|i)\right)\right\}.$$ | (5) |

The purpose of the above voting mechanism is to combine the local dual variables ${\mu}_{m},m\in \mathcal{M}$ into a global variable ${\mu}_{g}$, which will then determine the next sample, i.e., the state-action pair $({i}_{t},{a}_{t})$, to query from the sampling oracle for learning. Note that the global dual variable ${\mu}_{g}$ corresponds to the global acting policy, while the local dual variable ${\mu}_{m}$ corresponds to the local voting policy, such that the voting mechanism also indicates the relationship between the global acting policy and the local voting policies. In other words, the agents are actually voting for the next sample to query from the sampling oracle.

### 3.2 Distributed Primal-Dual Learning Algorithm

We now develop a primal-dual learning algorithm to solve problem (4) in a distributed manner based on a double-sampling strategy. Specifically, we first update the local dual variables based on uniform sampling and then updates the local primal variables based on probability sampling, with the probability specified by the dual variables. The detailed procedure is provided in Algorithm 2. Next, we present the local primal-dual update at each iteration $t$ and prove that such a local update is equivalent to the global update with a properly specified voting mechanism.

#### Local Dual Update.

The first state-action pair $({i}_{t},{a}_{t})$ to update the local dual variables is sampled with uniform probability ${p}_{i,a}^{\text{dual}}=\frac{1}{|\mathcal{S}||\mathcal{A}|}$. The MARL system then shifts to the next state ${j}_{t}$ conditioned on $({i}_{t},{a}_{t})$ and returns the local rewards ${\left\{{r}_{{i}_{t},{j}_{t}}^{m}(a)\right\}}_{m=1}^{M}$ to each corresponding agent. The local dual variable ${\bm{\mu}}^{m,t}$ of agent $m$ is updated as

$${\mu}_{i,a}^{m,t+1}={\mu}_{i,a}^{m,t}\mathrm{exp}\left\{{\mathrm{\Delta}}_{i,a}^{m,t+1}\right\},$$ | (6) |

where

$${\mathrm{\Delta}}_{i,a}^{m,t+1}=\{\begin{array}{cc}\beta \left(\frac{\frac{1}{\beta}\mathrm{log}{x}^{t}+{v}_{j}-{v}_{i}-S}{M}+{r}_{ij}^{m}(a)\right),\hfill & \text{if}i={i}_{t},a={a}_{t}\hfill \\ \mathrm{log}{x}^{t},\hfill & \text{otherwise}\hfill \end{array}$$ | (7) |

with $\beta $ the step-size and

$${x}^{t+1}=\frac{1}{{\sum}_{i\in \mathcal{S},a\in \mathcal{A}}\mathrm{exp}\left\{{\sum}_{m=1}^{M}\mathrm{log}\left({\mu}_{i,a}^{m,t+1}\right)\right\}}.$$ | (8) |

In fact, ${x}^{t}$ is the proportion between the locally recovered partial derivatives and the global true partial derivatives of the minimax objective in (4). It also defines the explicit form of the voting mechanism as in Lemma 1. However, note that with or without using ${x}_{t}$ in the algorithm does not affect the convergence since it does not influence the sampling in the next primal update step; we use it in the proof only for theoretical analysis purpose.

#### Local Primal Update.

The second state-action pair $({i}_{t},{a}_{t})$ to update the local primal variables is sampled with probability

$${p}_{{i}_{t},{a}_{t}}^{\text{primal}}=\frac{\mathrm{exp}\left\{{\sum}_{m=1}^{M}\mathrm{log}\left({\mu}_{{i}_{t},{a}_{t}}^{m,t+1}\right)\right\}}{{\sum}_{i\in \mathcal{S},a\in \mathcal{A}}\mathrm{exp}\left\{{\sum}_{m=1}^{M}\mathrm{log}\left({\mu}_{i,a}^{m,t+1}\right)\right\}}.$$ | (9) |

The system then shifts to the next state ${j}_{t}$ conditioned on $({i}_{t},{a}_{t})$, and returns the local rewards to each corresponding agent. The local primal variable ${\bm{v}}^{t}$ is updated as

${\bm{v}}^{t+\frac{1}{2}}$ | $={\bm{v}}^{t}+\alpha {\bm{d}}^{t+1},$ | (10) | ||

${\bm{v}}^{t+1}$ | $={\mathrm{\Pi}}_{\mathcal{V}}\left\{{\bm{v}}^{t+\frac{1}{2}}\right\},$ | (11) |

where $\alpha $ is the step-size and ${\mathrm{\Pi}}_{\mathcal{V}}\{\cdot \}$ denotes the projection to the search spaces $\mathcal{V}$, which is defined in Sec. 4. Note that the local primal update is identical across the agents. Hence, we use the same notation ${v}_{i}^{t}$ in the primal update for all the agents in the sequel.

#### Equivalent Global Update.

We next prove that the distributed primal-dual updates on the local voting policies are equivalent to the centralized primal-dual updates on the global acting policy, with the voting mechanism specified as follows.

###### Lemma 1 (Equivalent Global Update)

By specifying the voting mechanism as

${\mu}_{i,a}^{g,t+1}={x}^{t+1}\mathrm{exp}\left\{{\displaystyle \sum _{m=1}^{M}}\mathrm{log}\left({\mu}_{i,a}^{m,t+1}\right)\right\},$ |

where ${x}^{t\mathrm{+}\mathrm{1}}\mathrm{=}\mathrm{1}\mathrm{/}{\mathrm{\parallel}\mathrm{exp}\mathit{}\mathrm{\left\{}{\mathrm{\sum}}_{m\mathrm{=}\mathrm{1}}^{M}\mathrm{log}\mathit{}\mathrm{\left(}{\mu}_{i\mathrm{,}a}^{m\mathrm{,}t\mathrm{+}\mathrm{1}}\mathrm{\right)}\mathrm{\right\}}\mathrm{\parallel}}_{\mathrm{1}}$, the local primal-dual updates are equivalent to the following global primal-dual updates:

${\bm{\mu}}^{g,t+1}$ | $={x}^{t+1}{\bm{\mu}}^{g,t}\mathrm{exp}\left\{{\mathbf{\Delta}}^{g,t+1}\right\},$ | ||

${\bm{v}}^{t+1}$ | $={\mathrm{\Pi}}_{\mathcal{V}}\left\{{\bm{v}}^{t}+{\bm{d}}^{t+1}\right\},$ |

where

${\mathrm{\Delta}}_{i,a}^{g,t+1}$ | $=\beta \left({v}_{j}-{v}_{i}-S+{\displaystyle \sum _{m\in \mathcal{M}}}{r}_{ij}^{m}(a)\right),$ | ||

${\bm{d}}^{t+1}$ | $=\alpha \left({\bm{e}}_{i}-{\bm{e}}_{j}\right).$ |

Remark that the global primal-dual updates are conditionally unbiased partial derivatives of the minimax objective given in (4).

###### Lemma 2 (Unbiasedness)

By specifying the voting mechanism as in Lemma 1, the gradient of the dual variable ${\mathrm{\Delta}}_{i\mathrm{,}a}^{g\mathrm{,}t\mathrm{+}\mathrm{1}}$ is the conditionally partial derivative of the minimax objective with a constant bias, concretely,

$$\begin{array}{c}\hfill \mathbb{E}\left[{\mathrm{\Delta}}_{i,a}^{g,t+1}\mid {\mathcal{F}}_{t}\right]=\frac{\beta}{|\mathcal{S}||\mathcal{A}|}{\left(({P}_{a}-I){\bm{v}}^{t}+\sum _{m\in \mathcal{M}}{\bm{r}}_{a}^{m}-S\right)}^{i},\forall a\in \mathcal{A},\forall i\in \mathcal{S},\end{array}$$ |

The gradient of the primal variable ${\mathbf{d}}^{t\mathrm{+}\mathrm{1}}$ is the conditionally unbiased partial derivative of the minimax objective; concretely,

$$\mathbb{E}\left[{d}_{i}^{t+1}\mid {\mathcal{F}}_{t}\right]=\alpha \sum _{a\in \mathcal{A}}{\mu}_{i,a}^{g,t}(I-{P}_{a}),\forall i\in \mathcal{S}.$$ |

[!h] \[email protected]@algorithmic[1] \STATEInitialization: \STATEMARL tuple $\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{P},{\left\{{\mathcal{R}}_{m}\right\}}_{m=1}^{M})$, $T$, $\alpha =|\mathcal{S}|(4{t}_{\text{mix}}^{*}+M)\sqrt{\frac{\mathrm{log}(|\mathcal{S}||\mathcal{A}|)}{2|\mathcal{S}||\mathcal{A}|T}}$, $\beta =\frac{1}{4{t}_{\text{mix}}^{*}+M}\sqrt{\frac{\mathrm{log}(|\mathcal{S}||\mathcal{A}|)}{2|\mathcal{S}||\mathcal{A}|T}}$, $S=4{t}_{\text{mix}}^{*}+M$. \STATESet $\bm{v}=\mathrm{\U0001d7ce}\in {\mathbb{R}}^{|\mathcal{S}|}$ and ${\mu}_{i,a}^{m,0}=\frac{1}{|\mathcal{S}||\mathcal{A}|},\forall i\in \mathcal{S},\forall a\in \mathcal{A}$.. \STATEIteration: \FOR$t=1,2,\mathrm{\cdots},T$ \STATEThe system samples $({i}_{t},{a}_{t})$ with probability ${p}_{{i}_{t},{a}_{t}}^{\text{dual}}$. \STATEThe system shifts to next state ${j}_{t}$ conditioned on $({i}_{t},{a}_{t})$, and generates the local rewards ${\left\{{r}_{{a}_{t}}^{m}\right\}}_{m=1}^{M}$. \FOR $m=1,2,\mathrm{\cdots},M$ \STATEThe agent $m$ updates its local dual variable according to

$${\mu}_{i,a}^{m,t+1}=\{\begin{array}{cc}{\mu}_{i,a}^{m,t}\mathrm{exp}\left\{\beta \left(\frac{{v}_{j}-{v}_{i}-S}{M}+{r}_{ij}^{m}(a)\right)\right\},\hfill & \text{if}i={i}_{t},a={a}_{t}\hfill \\ {\mu}_{i,a}^{m,t}.\hfill & \text{otherwise}\hfill \end{array}$$ |

The system samples $({i}_{t},{a}_{t})$ with probability ${p}_{{i}_{t},{a}_{t}}^{\text{primal}}$. \STATEThe system shifts to next state ${j}_{t}$ conditioned on $({i}_{t},{a}_{t})$, and generates the local rewards ${\left\{{r}_{{a}_{t}}^{m}\right\}}_{m=1}^{M}$. \FOR $m=1,2,\mathrm{\cdots},M$ \STATEThe agent $m$ updates its local primal according to (10) and (11). \ENDFOR\STATE$t=t+1$. \ENDFOR\STATE${\widehat{\mu}}_{i}^{g}=\frac{1}{T}{\sum}_{t=1}^{T}\mathrm{exp}\left\{{\sum}_{m=1}^{M}\mathrm{log}\left({\mu}_{i}^{m,t+1}\right)\right\},\forall i\in \mathcal{S}$. \STATEReturn: \STATE${\widehat{\pi}}_{i,a}^{g}=\frac{{\widehat{\mu}}_{i,a}^{g}}{{\sum}_{a\in \mathcal{A}}{\widehat{\mu}}_{i,a}^{g}},\forall i\in \mathcal{S}$.

## 4 Theoretical Results

In this section, we establish the convergence and sample complexity analysis for Algorithm 2. We start by making the following assumptions over the stationary distribution and the mixing time over the considered multi-agent AMDP. Similar assumptions have also been used in [7, 39] for the case with a single-agent RL.

###### Assumption 1

The multi-agent AMDP is ergodic under any stationary acting policy ${\pi}^{g}$ and there exists $\tau \mathrm{>}\mathrm{1}$ such that

$$\frac{1}{\sqrt{\tau}|\mathcal{S}|}\mathrm{\U0001d7cf}\le {\bm{\nu}}^{{\pi}^{g}}\le \frac{\sqrt{\tau}}{|\mathcal{S}|}\mathrm{\U0001d7cf},$$ |

where ${\mathbf{\nu}}^{{\pi}^{g}}$ is the stationary distribution under a stationary policy ${\pi}^{g}$.

The above assumption requires the multi-agent AMDP to be ergodic (aperiodic and recurrent) with the parameter $\tau $, characterizing the variation of stationary distributions associated with the acting policy.

###### Assumption 2

There exists a constant ${t}_{\text{\mathit{m}\mathit{i}\mathit{x}}}\mathrm{>}\mathrm{0}$ such that for any stationary policy ${\pi}^{g}$, we have

$${t}_{\text{\mathit{m}\mathit{i}\mathit{x}}}\ge \underset{t}{\mathrm{min}}\{t\mid {\parallel {({P}^{{\pi}^{g}})}^{t}(i,\cdot )-{\bm{\nu}}^{{\pi}^{g}}\parallel}_{TV}\le \frac{1}{4},\forall i\in \mathcal{S}\},$$ |

where $\mathrm{\parallel}\mathrm{\cdot}{\mathrm{\parallel}}_{T\mathit{}V}$ is the total variation.

The above assumption requires the multi-agent AMDP to be sufficiently rapidly mixing with the parameter ${t}_{\text{mix}}$, characterizing how fast the multi-agent AMDP reaches its stationary distribution from any state under any acting policy [39]. In other words, ${t}_{\text{mix}}$ actually characterizes the distance between any stationary policy and the optimal policy under the considered multi-agent AMDP. In the following analysis, we focus on the optimal difference-of-value vector that satisfies ${\parallel {v}^{*}\parallel}_{\mathrm{\infty}}\le 2{t}_{\text{mix}}$, which has been proven to exist under Assumption 2 [39].

Based on Assumption 1 and Assumption 2, we specify the search spaces for the global primal variable $\bm{v}$ and the global dual variable $\bm{\mu}$, respectively, as

$\mathcal{U}$ | $=\{{\parallel {\mu}^{g}\parallel}_{1,1}=1,{\mu}_{i,a}^{g}\ge 0,{\displaystyle \sum _{a\in \mathcal{A}}}{\bm{\mu}}_{a}^{g}\ge {\displaystyle \frac{1}{\sqrt{\tau}|\mathcal{S}|}}\mathrm{\U0001d7cf}\},$ | (12a) | ||

$\mathcal{V}$ | $=\{||\bm{v}|{|}_{\mathrm{\infty}}\le 2{t}_{\text{mix}}^{*}\}.$ | (12b) |

### 4.1 Convergence Analysis

In this section, we establish the convergence result of the proposed Algorithm 2.

###### Theorem 1 (Finite-Iteration Duality Gap)

Let $\mathrm{M}\mathrm{=}\mathrm{(}\mathrm{S}\mathrm{,}\mathrm{A}\mathrm{,}\mathrm{P}\mathrm{,}{\mathrm{\left\{}{\mathrm{R}}_{m}\mathrm{\right\}}}_{m\mathrm{=}\mathrm{1}}^{M}\mathrm{)}$ be an arbitrary multi-agent AMDP tuple satisfying Assumption 1 and Assumption 2. The sequence of iterates generated by Algorithm 2 satisfies

$$\begin{array}{c}\hfill \frac{1}{T}\sum _{t=1}^{T}\mathbb{E}\left[\sum _{a\in \mathcal{A}}{({\bm{v}}^{*}-{P}_{a}{\bm{v}}^{*}-\sum _{m\in \mathcal{M}}{\bm{r}}_{a}^{m})}^{\top}{\mu}_{a}^{g,t}\right]+{\overline{v}}^{*}\le \stackrel{~}{O}\left(\left(4{t}_{\text{\mathit{m}\mathit{i}\mathit{x}}}^{*}+M\right)\sqrt{\frac{|\mathcal{S}||\mathcal{A}|}{T}}\right).\end{array}$$ |

Theorem 1 establishes a sub-linear error bound of the linear complementarity corresponding to problem (4). The result also covers the single-agent RL [39] as a special case, which makes our model more general.

It is noteworthy to point out that in our proof, the scalar $M$ in Theorem 1, i.e., the number of agents, comes from the bound of the total reward of all agents ${\sum}_{m\in \mathcal{M}}{r}^{m}\in (0,M)$. As such, if we consider a normalized reward where ${\sum}_{m\in \mathcal{M}}{r}^{m}\in (0,1)$, the complexity in Theorem 1 will no longer be related to $M$, namely, scale-free over the number of agents. The proofs are deferred to the appendix.

### 4.2 Sample Complexity Analysis

In this section, we analyze the sample complexity of the proposed Algorithm 2. The aim of Algorithm 2 is to obtain the optimal acting policy ${\pi}^{g,*}$ which maximizes the value function ${\overline{v}}^{{\pi}^{g}}$ defined in (2). Denoting the optimal value function as ${\overline{v}}^{*}$ and the value function of the acting policy generated by Algorithm 2 as ${\overline{v}}^{{\widehat{\pi}}^{g}}$, the gap between ${\overline{v}}^{*}$ and ${\overline{v}}^{{\widehat{\pi}}^{g}}$ can be quantified as follows.

###### Theorem 2 (Sample Complexity)

Consider an arbitrary multi-agent AMDP tuple

$\mathrm{M}\mathrm{=}\mathrm{(}\mathrm{S}\mathrm{,}\mathrm{A}\mathrm{,}\mathrm{P}\mathrm{,}{\mathrm{\left\{}{\mathrm{R}}_{m}\mathrm{\right\}}}_{m\mathrm{=}\mathrm{1}}^{M}\mathrm{)}$, which satisfies Assumption 1 and Assumption 2.
The value function of the acting policy ${\widehat{\pi}}^{g}$ generated by running Algorithm 2 for $T$ iterations satisfies

$${\overline{v}}^{*}-\mathbb{E}[{\overline{v}}^{{\widehat{\pi}}^{g}}]=\mathcal{O}\left(\tau \left(4{t}_{\text{\mathit{m}\mathit{i}\mathit{x}}}^{*}+M\right)\sqrt{\frac{|\mathcal{S}||\mathcal{A}|}{T}}\right)$$ |

where ${\overline{v}}^{\mathrm{*}}$ is the value function of the optimal acting policy.

Theorem 2 indicates that the proposed distributed primal-dual algorithm obtains an $\u03f5$-optimal policy with samples of size $\mathcal{O}(\frac{{\tau}^{2}{\left(4{t}_{\text{mix}}^{*}+M\right)}^{2}|\mathcal{S}||\mathcal{A}|}{{\u03f5}^{2}})$. Similar to Theorem 1, the scalar $M$ also comes from the bound of the total reward of all agents. The proofs are deferred to the appendix.

## 5 Numerical Results

This section evaluates the proposed voting-based MARL algorithm with two experiments: 1) verifying the convergence of our proposed algorithm with the generated MDP instances; 2) applying the proposed algorithm to a voting-based multi-agent system in wireless communication. The experiments mainly show that: 1) the distributed decision making does not slow down the global consensus to reach the optimality; 2) the voting-based learning is more efficient than letting agents behave individually and selfishly. Moreover, we add another experiment on a multi-agent queuing system in the appendix.

### 5.1 Empirical Convergence

We generate the instances of multi-agent MDP using a similar setup as in White and White [1]. Specifically, given a state and an action, the multi-agent MDP shifts to the next state assigned from the entire set without replacement. The transition probabilities are generated randomly from $[0,1]$ and the transitions are normalized to sum as one. The optimal policy is generated with purposeful behaviors by letting the agent favor a single action in each state and assigning it with a higher expected reward in $[0,1]$.

In Figure 1, we show the empirical convergence result over one million iterations. The error $\u03f5={\parallel {\overline{v}}^{*}-{\overline{v}}^{\widehat{\pi}}\parallel}_{1}$ is averaged over $100$ instances. The result shows that the empirical convergence rate well supports the result given in Theorem 2. Moreover, we also present: 1) the performance change of varying the number of local agents from $M=5$ to $M=100$, and 2) the performance of centralized learning, which directly uses the global primal-dual updates to learn the global policy. The result shows that the empirical convergence rates of the centralized case and the distributed case are the same over different numbers of agents $M$, which verifies that the distributed decision making does not slow down the global consensus to reach the optimality.

### 5.2 Application to Wireless Communication Systems

We now apply the proposed voting-based MARL algorithm to a multi-agent system in wireless communication. Recently, the unmanned aerial vehicles (UAV) assisted wireless communication has attracted much research attention [29, 14, 25, 6]. However, obtaining the best performance in a unmanned aerial vehicle mounted with a mobile base station (UAV-BS) assisted wireless system highly depends on the optimal placement of the UAV-BSs [29, 14, 25]. We here consider optimizing the placement of multiple UAV-BSs through our proposed voting-based MARL algorithm. More details are deferred to the appendix.

Specifically, the considered single UAV-BS assisted wireless system is shown in Figure 1. The action is to move the UAV-BS to any one of the $|\mathcal{A}|=9$ candidate aerial locations, which is determined by the votes from ground-BSs. The investigated area is regularly divided into nine grids, and the system states are characterized by the load distribution of the grids, including $|\mathcal{S}|=510$ conditions. The reward function is defined to maximize the user throughput. Due to space limitation, we defer the detailed experiment settings, e.g., the wireless channel model, load and reward definitions, to the appendix.

Figure 1 present the rewards averaged over $20$ runs. We compare the proposed voting-based scheme with two baselines: 1) the selfish-action scheme, where each agent is maximizing its own rewards and the MARL system randomly chooses one agent to determine the global action per iteration; 2) the optimal scheme, obtained by assuming the underlying MDP known. The result shows that the proposed voting-based mechanism can well coordinate the underlying agents to approach the optimal global rewards faster than letting all the agents perform selfishly, which justifies the learning effectiveness and the possibility to be applied to practical multi-agent systems.

## 6 Conclusions

In this paper, we considered a collaborative MARL problem, where the agents vote to make group decisions. Specifically, the agents are coordinated following the proposed voting mechanism without revealing their own rewards to each other. We cast the concerned MARL problem into a saddle point formulation and proposed a distributed primal-dual learning algorithm, which could achieve the same sub-linear convergence rate as centralized learning. Finally, we also provided empirical experiments to demonstrate the learning effectiveness.

## References

- [1] A. Adam and M. White. Investigating practical linear temporal difference learning. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 494–502, Singapore, May 2016.
- [2] G. Arslan and S. Yüksel. Decentralized Q-learning for stochastic teams and games. IEEE Transactions on Automatic Control, 62(4):1545–1558, April 2017.
- [3] B. Aygun and A. M. Wyglinski. A voting-based distributed cooperative spectrum sensing strategy for connected vehicles. IEEE Transactions on Vehicular Technology, 66(6):5109–5121, June 2017.
- [4] M. G. Azar, R. Munos, and H. J. Kappen. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349, June 2013.
- [5] D. P. Bertsekas. Dynamic programming and optimal control. Athena scientific Belmont, MA, 2005.
- [6] M. Chen, M. Mozaffari, W. Saad, C. Yin, M. Debbah, and C. S. Hong. Caching in the sky: Proactive deployment of cache-enabled unmanned aerial vehicles for optimized quality-of-experience. IEEE Journal on Selected Areas in Communications, 35(5):1046–1061, May 2017.
- [7] Y. Chen, L. Li, and M. Wang. Scalable bilinear $\pi $ learning using state and action features. In International Conference on Machine Learning (ICML), Stockholm, Sweden, July 2018, to appear.
- [8] Y. Chen and M. Wang. Stochastic primal-dual methods and sample complexity of reinforcement learning. arXiv preprint arXiv:1612.02516, 2016.
- [9] B. Dai, N. He, Y. Pan, B. Boots, and L. Song. Learning from conditional distributions via dual embeddings. arXiv preprint arXiv:1607.04579, 2016.
- [10] T. G. Dietterich, M. A. Taleghan, and M. Crowley. PAC optimal planning for invasive species management: Improved exploration for reinforcement learning from simulator-defined MDPs. In AAAI Conference on Artificial Intelligence (AAAI), Bellevue, Washington, July 2013.
- [11] S. S. Du, J. Chen, L. Li, L. Xiao, and D. Zhou. Stochastic variance reduction methods for policy evaluation. In International Conference on Machine Learning (ICML), pages 1049–1058, Sydney, Australia, August 2017.
- [12] J. Foerster, I. A. Assael, N. de Freitas, and S. Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 2137–2145, Barcelona, Spain, December 2016.
- [13] J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. Torr, P. Kohli, and S. Whiteson. Stabilising experience replay for deep multi-agent reinforcement learning. In International Conference on Machine Learning (ICML), pages 1146–1155, Sydney, Australia, August 2017.
- [14] R. Ghanavi, E. Kalantari, M. Sabbaghian, H. Yanikomeroglu, and A. Yongacoglu. Efficient 3D aerial base station placement considering users mobility by reinforcement learning. In IEEE Wireless Communications and Networking Conference (WCNC), pages 1–6, Barcelona, Spain, April 2018.
- [15] J. K. Gupta, M. Egorov, and M. Kochenderfer. Cooperative multi-agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems (AAAMS), pages 66–83, São Paulo, Brazil, May 2017.
- [16] A. Haris, B. Markus, C. Vincent, E. Edith, F. Rupert, and W. Toby. Justified representation in approval-based committee voting. Social Choice and Welfare, 48(1):461–485, February 2014.
- [17] N. Katenka, E. Levina, and G. Michailidis. Local vote decision fusion for target detection in wireless sensor networks. IEEE Transactions on Signal Processing, 56(1):329–338, January 2008.
- [18] M. Kearns, Y. Mansour, and A. Y. Ng. A sparse sampling algorithm for near-optimal planning in large markov decision processes. Machine learning, 49(2-3):193–208, November 2002.
- [19] M. J Kearns and S. P. Singh. Finite-sample convergence rates for Q-learning and indirect algorithms. In Advances in Neural Information Processing Systems (NeurIPS), pages 996–1002, Denver, CO, December 1999.
- [20] D. Lee, H. Yoon, and N. Hovakimyan. Primal-dual algorithm for distributed reinforcement learning: distributed GTD. In IEEE Conference on Decision and Control (CDC), pages 1967–1972, Miami Beach, USA, December 2018.
- [21] M. L. Littman. Markov games as a framework for multi-agent reinforcement learning. In International Conference on Machine Learning (ICML), pages 157–163, New Brunswick, NJ, USA, July 1994.
- [22] M. L. Littman. Friend-or-foe Q-learning in general-sum games. In International Conference on Machine Learning (ICML), pages 322–328, Williamstown, MA, USA, July 2001.
- [23] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems (NeurIPS), pages 6379–6390, Long Beach, CA, USA, December 2017.
- [24] J. Lyu, Y. Zeng, R. Zhang, and T. J. Lim. A survey of mobility models for ad hoc network research. Wireless Communications and Mobile Computing, 2(5):483–502, August 2002.
- [25] J. Lyu, Y. Zeng, R. Zhang, and T. J. Lim. Placement optimization of UAV-mounted mobile base stations. IEEE Commun. Lett., 21(3):604–607, March 2017.
- [26] S. V. Macua, J. Chen, S. Zazo, and A. H. Sayed. Distributed policy evaluation under multiple behavior strategies. IEEE Transactions on Automatic Control, 60(5):1260–1274, May 2015.
- [27] M. Mahmud, M. S. Kaiser, A. Hussain, and S. Vassanelli. Applications of deep learning and reinforcement learning to biological data. IEEE Transactions on Neural Networks and Learning Systems, 29(6):2063–2079, June 2018.
- [28] A. Mathkar and V. S. Borkar. Distributed reinforcement learning via gossip. IEEE Transactions on Automatic Control, 62(3):1465–1470, March 2017.
- [29] M. Mozaffari, W. Saad, M. Bennis, Y.-H. Nam, and M. Debbah. A tutorial on UAVs for wireless networks: Applications, challenges, and open problems. arXiv preprint arXiv:1803.00680, 2018.
- [30] S. M. Nam and T. H. Cho. Context-aware architecture for probabilistic voting-based filtering scheme in sensor networks. IEEE Transactions on Mobile Computing, 16(10):2751–2763, October 2017.
- [31] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian. Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In International Conference on Machine Learning (ICML), pages 2681–2690, Sydney, NSW, Australia, August 2017.
- [32] L. Panait and S. Luke. Cooperative multi-agent learning: The state of the art. Autonomous Agents and Multi-Agent Systems, 11(3):387–434, November 2005.
- [33] M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- [34] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, den D. G. Van, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, and M. Lanctot. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, January 2016.
- [35] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of Go without human knowledge. Nature, 550(7676):354, October 2017.
- [36] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
- [37] M. A. Taleghan, T. G. Dietterich, M. Crowley, K. Hall, and H. J. Albers. PAC optimal MDP planning with application to invasive species management. The Journal of Machine Learning Research, 16(1):3877–3903, January 2015.
- [38] H.-T. Wai, Z. Yang, Z. Wang, and M. Hong. Multi-agent reinforcement learning via double averaging primal-dual optimization. In Advances in Neural Information Processing Systems (NeurIPS), Montréal, Canada, December 2018, to appear.
- [39] M. Wang. Primal-dual $\pi $ learning: Sample complexity and sublinear run time for ergodic Markov decision problems. arXiv preprint arXiv:1710.06100, 2017.
- [40] Z. Wang, L. Li, Y. Xu, H. Tian, and S. Cui. Handover control in wireless systems via asynchronous multi-user deep reinforcement learning. IEEE Internet of Things Journal, 5(6):4296–4307, June 2018.
- [41] Y. Xu, W. Xu, Z. Wang, J. Lin, and S. Cui. Deep reinforcement learning based mobility load balancing under multiple behavior policies. In IEEE International Conference on Communications (ICC), Shanghai, China, May 2019, to appear.
- [42] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang. Mean field multi-agent reinforcement learning. arXiv preprint arXiv:1802.05438, 2018.
- [43] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Başar. Fully decentralized multi-agent reinforcement learning with networked agents. In International Conference on Machine Learning (ICML), Stockholm, Sweden, July 2018, to appear.