### Abstract

Deep Reinforcement Learning has made significant progress in multi-agentsystems in recent years. In this review article, we have mostly focused onrecent papers on Multi-Agent Reinforcement Learning (MARL) than the olderpapers, unless it was necessary. Several ideas and papers are proposed withdifferent notations, and we tried our best to unify them with a single notationand categorize them by their relevance. In particular, we have focused on fivecommon approaches on modeling and solving multi-agent reinforcement learningproblems: (I) independent-learners, (II) fully observable critic, (III) valuefunction decomposition, (IV) consensus, (IV) learn to communicate. Moreover, wediscuss some new emerging research areas in MARL along with the relevant recentpapers. In addition, some of the recent applications of MARL in real world arediscussed. Finally, a list of available environments for MARL research areprovided and the paper is concluded with proposals on the possible researchdirections.

### Quick Read (beta)

# A Review of Cooperative Multi-Agent Deep Reinforcement Learning

###### Abstract

Deep Reinforcement Learning has made significant progress in multi-agent systems in recent years. In this review article, we have mostly focused on recent papers on Multi-Agent Reinforcement Learning (MARL) than the older papers, unless it was necessary. Several ideas and papers are proposed with different notations, and we tried our best to unify them with a single notation and categorize them by their relevance. In particular, we have focused on five common approaches on modeling and solving multi-agent reinforcement learning problems: (I) independent-learners, (II) fully observable critic, (III) value function decomposition, (IV) consensus, (IV) learn to communicate. Moreover, we discuss some new emerging research areas in MARL along with the relevant recent papers. In addition, some of the recent applications of MARL in real world are discussed. Finally, a list of available environments for MARL research are provided and the paper is concluded with proposals on the possible research directions.

A Review of Cooperative Multi-Agent Deep Reinforcement Learning

Afshin OroojlooyJadid and Davood Hajinezhad
{afshin.oroojlooy, davood.hajinezhad}@sas.com
SAS Institute Inc., Cary, NC, USA

## 1 Introduction

Multi-Agent Reinforcement Learning (MARL) is a system of agents (robots, machines, cars, etc.) which are interacting within a common environment. Each agent makes a decision in each time-step and works along with the other agent(s) to achieve a given goal. Basically, the agents are learnable units which want to learn policy on the fly in order to maximize the long-term reward through the interaction with the environment. Due to the complexities of the environments and the combinatorial nature of the problem, training the agents is a challenging problem and most MARL problems are categorized as NP-Hard problems [bernstein2002complexity].

With the motivation of recent success on deep reinforcement learning (RL)—super-human level control on Atari games [mnih2015human], mastering the game of Go [silver2016mastering], chess [silver2017mastering], robotic [kober2013reinforcement], health care planning [liu2017deep], power grid [glavic2017reinforcement], routing [nazari2018reinforcement], and inventory optimization [oroojlooyjadid2017deep]—several researches have been focused on deep MARL. One naive approach to train a MARL problem could be training all the agents using a centralized controller, which in fact turns to a single agent RL problem to obtain the joint action of all agents to execute in each time step. However, in this approach, the number of actions exponentially increases, which makes the problem intractable. Besides, each agent needs to send its local information to the central controller and with increasing the number of agents, this approach becomes very expensive or impossible. In addition to the communication cost, this approach is vulnerable to the presence of the central unit and any incident for that results in the loss of the network. Moreover, usually in MARL problems, each agent accesses only to some local information and due to privacy issues they may not be allowed to share their information with the other agents.

There are several properties of the system that is important in modeling a MARL system such as (i) centralized or decentralized control, (ii) fully or partially observable environment, (iii) cooperative or competitive environment. Within a centralized controller, a central unit takes the decision for each agent in each time step. On the other hand, in the decentralized system, each agent takes a decision for itself. Also, the agents might cooperate to achieve a common goal, e.g. a group of robots who want to identify a source or they might compete with each other to maximize their own reward, e.g. the players in different teams of a game. In each of these cases, the agent might be able to access the whole information and the sensory observation (if any) of the other agents, or on the other hand, each agent might be able to observe only its local information. In this paper, we have focused on the decentralized models with the cooperative goal, and most of the relevant papers with either full or partial observability are reviewed. Note that, [matignon2012independent, bucsoniu2010multi, bu2008comprehensive] provide reviews on cooperative games and general MARL papers published till 2012. Also, [da2019survey] provides a survey over the utilization of transfer learning in MARL. We mostly focused on the recent works on the cooperative Deep MARL. Furthermore, MARL problems usually include large state/action spaces. Therefor, the classical tabular RL algorithm are not efficient to solve them, and we mostly focus on the approximated RL algorithms.

The rest of the paper is organized as the following: in Section 1.1 we briefly explain the single agent RL problem and some of its components. Then the multi-agent formulation is represented and some of the main challenges of MARL environment is described in Section 1.2. Section 2 explains the IQL-type algorithm, Section 3 reviews the papers with a fully observable critic model, Section 4 includes the value decomposition papers, Section 5 explains the consensus approach, Section 6 reviews the learn-to-communicate approach, Section 7 explains some of the emerging research directions, Section 8 provides some applications of the MARL models in real-world, Section 9 very briefly mentions some of the available multi-agent environments, and finally Section 10 concludes the paper.

### 1.1 Single-Agent RL Formulation

RL considers a sequential decision making problem in which an agent interacts with an environment. The agent at time period $t$ observes state ${s}_{t}\in \mathcal{S}$ in which $\mathcal{S}$ is the state space, takes action ${a}_{t}\in \mathcal{A}({s}_{t})$ where $\mathcal{A}({s}_{t})$ is the valid action space for state ${s}_{t}$, and executes it in the environment to receive reward ${r}_{t}$ and transfer to the new state ${s}_{t+1}\in \mathcal{S}$. The goal of the agent is to determine a policy that maximizes the long-term reward. The value function starting from state $s$ and following policy $\pi $ denoted by ${V}^{\pi}(s)$ is given by

${V}^{\pi}(s)={\mathbb{E}}_{\pi}[{\displaystyle \sum _{t=1}^{\mathrm{\infty}}}{\gamma}^{t-1}{r}_{t+1}|{s}_{0}=s],$ | (1) |

where, $\gamma \in [0,1]$ is the discounting factor. This problem is known as the Markov Decision Process (MDP) and provides a framework to study problems in a given environment. Within a given state transition probability distribution $p({s}_{t+1}|{s}_{t},{a}_{t})$ and reward matrix $r({s}_{t},{a}_{t})$, Bellman [bellman1957markovian] showed that the following equation holds for all state ${s}_{t}$ in any time step $t$:

$${V}^{\pi}({s}_{t})=\sum _{a\in \mathcal{A}({s}_{t})}\pi (a|{s}_{t})\sum _{{s}^{\prime}\in \mathcal{S}}p({s}^{\prime}|{s}_{t},a)\left[r({s}_{t},a)+\gamma {V}^{\pi}({s}^{\prime})\right],$$ | (2) |

where here and in the rest of the paper we use ${s}^{\prime}$ to denote the successor state of $s$.

Instead of averaging, by maximizing over the actions the optimal state-value and optimal policy can be obtained by:

$${V}^{{\pi}^{\ast}}({s}_{t})=\underset{a}{\mathrm{max}}\sum _{{s}^{\prime}}p({s}^{\prime}|{s}_{t},a)\left[r({s}_{t},a)+\gamma {V}^{{\pi}^{\ast}}({s}^{\prime})\right].$$ | (3) |

Similarly, the optimal Q-value for each state-action can be obtained by:

$${Q}^{{\pi}^{\ast}}({s}_{t},{a}_{t})=\sum _{{s}^{\prime}}p({s}^{\prime}|{s}_{t},{a}_{t})\left[r({s}_{t},{a}_{t})+\gamma \underset{{a}^{\prime}}{\mathrm{max}}{Q}^{{\pi}^{\ast}}({s}^{\prime},{a}^{\prime})\right].$$ | (4) |

One can obtain an optimal policy ${\pi}^{*}$ utilizing the above equations. The relevant methods are called value-based methods. However, in the real world usually $p({s}_{t+1}|{s}_{t},{a}_{t})$ is not available and one cannot obtain the optimal policy using (3), or (4). In order to address this issue approximating the state-value, the Q-value, or the policy has been a common practice. Therefore, a function is used to approximate the state-value (called value-function) or the policy (called policy-function) for any possible state. Similarly, Q-value is approximated by a Q-function. The function approximator with parameters $\theta $ can be a simple linear regression model or a deep neural network. The goal is to maximize the utility function,

$$J(\theta )={\mathbb{E}}_{\theta ,s}\sum _{t=0}^{\mathrm{\infty}}{\gamma}^{t}r({s}_{t},{a}_{t};\theta ).$$ | (5) |

There are two main approaches to solve this problem. In the first approach, a value function or state/value is being learned to approximate the expected sum of the discounted long-term reward. Given these approximations, the optimal policy can be achieved. In the second approach, the policy is directly learned which determines the probability of choosing an action for a given state. We explain a brief overview of the first and second approaches in Sections 1.1.1 and 1.1.2, respectively. In either of these approaches, the distribution of $\theta $ and $s$ are unknown, so that the expectation is approximated by averaging over samples from some trajectories. Note that there is a large body of literature on the single-agent RL algorithms, and we only reviewed those algorithms which are actively used in multi-agent literature too. So, we did not explain advanced algorithms like TRPO [schulman2015trust] and PPO [schulman2017proximal] which are not so common in MARL literature. For more details about other single-agent RL algorithms see [li2017deep, sutton2018reinforcement].

#### 1.1.1 Value Approximation

Within the value approximation, a function to estimate $V(s)$ or $Q(s,a)$ is learned. It is showed that with large enough number of observation, a linear approximation converges to a local optimal [bertsekas96, sutton2000policy]. Nevertheless, the linear approximators are not powerful enough to capture all complexities of a complex environment. To address this issue, there has been a tremendous amount of research on non-linear approximators and especially neural networks in recent years [mnih2013playing, bhatnagar2009convergent, gao2019reduced]. Among the recent works on the value-based methods, deep Q-network (DQN) algorithm [mnih2015human] got a lot of attention due to its supper human-level performance on the Atari-2600 games [bellemare2013arcade], in which it only observes the video of the game. DQN utilizes the replay memory buffer [lin1992self] and a moving target network to stabilize the training. The replay memory holds the previous observation tuple $({s}_{t},{a}_{t},{r}_{t},{s}_{t+1},{d}_{t})$ in which ${d}_{t}$ determines if the episode ended with this observation. Then the approximator is trained by taking a random mini-batch from the replay memory. DQN algorithm uses the convolutional neural network (CNN) to the state ${s}_{t}$ as the input and then pass the flatted output of the CNN to a fully connected neural network such that it includes $|A|$ outputs to approximate the Q-value for each possible action. This chain of the neural networks with weights $\theta $ is then trained by taking a mini-batch of size $m$ to minimize loss function:

$L(\theta )=$ | $\frac{1}{m}}{\displaystyle \sum _{i=1}^{m}}{\left({y}_{i}-Q({s}_{i},{a}_{i};\theta )\right)}^{2$ | (6) | ||

${y}_{i}=$ | $\{\begin{array}{cc}{r}_{i},\hfill & {d}_{t}=\text{True},\hfill \\ {r}_{i}+\gamma {\mathrm{max}}_{{a}^{\prime}}Q({s}_{i}^{\prime},{a}^{\prime},{\theta}^{-})\hfill & {d}_{t}=\text{False},\hfill \end{array}$ | (7) |

where, ${\theta}^{-}$ is the weights of the *target network* which is updated by $\theta $ every $C$ iterations.
Later, new DQN-based approaches proposed for solving RL problems. For example, inspired by [hasselt2010double] which proposed double Q-learning, the Double-DQN algorithm proposed in [van2016deep]. Similarly, another heuristic formula to obtain ${y}_{i}$ is proposed as Dueling DQN [wang2015dueling].
In addition, another extension of theDQN algorithm is proposed by combining recurrent neural network with DQN. DRQN [hausknecht2015deep] is a DQN algorithm which uses an LSTM instead of a feed-forward network and is applied to a partially observable environment.
In all these algorithms, usually, the $\u03f5$-Greedy algorithm is used to ensure the exploration. That is, in each time-step with a probability of $\u03f5$ the action is chosen randomly and otherwise it is selected greedily by getting an argmax over the Q-values for the state. Typically, the value of $\u03f5$ is annealed during the training.
To see other extensions of the value-based algorithm and other exploration algorithms, see [li2017deep].

#### 1.1.2 Policy Approximation

In the value-based methods, the key idea is learning the optimal value-function or the Q-function, and from there a greedy policy could be obtained. In another direction, one can parametrize the policy, make a merit function, and try to optimize this function over the policy parameter through a training process. This class of RL algorithms is called policy-based methods. For example, in the policy gradient method, a stochastic policy by parameters $\theta \in {\mathbb{R}}^{d}$ is learned. First, define

$$h({s}_{t},{a}_{t};\theta )={\theta}^{T}\varphi ({s}_{t},{a}_{t}),$$ |

where, $\varphi ({s}_{t},{a}_{t})\in {\mathbb{R}}^{d}$ is called the feature vector of the state-action pair $(s,a)$. Then, the stochastic policy can be obtained by:

$$\pi (a|s;\theta )=\frac{{e}^{h({s}_{t},{a}_{t};\theta )}}{{\sum}_{b}{e}^{h({s}_{t},b;\theta )}},$$ |

which is the softmax function. The goal is to directly learn the parameter $\theta $ using a gradient-based algorithm. Let us $J(\theta )$ measures the value function for the policy ${\pi}_{\theta}$ starting from ${s}_{0}$. Then the policy gradient theorem provides an analytical expression for $J(\theta )$ as the following:

$${\nabla}_{\theta}J(\theta )={\mathbb{E}}_{{\pi}_{\theta}}\left[{\nabla}_{\theta}\mathrm{log}\pi (a|s;\theta ){Q}_{{\pi}_{\theta}}(s,a)\right],$$ | (8) |

Then the policy-based methods update the parameter $\theta $ as below:

${\theta}^{t+1}={\theta}^{t}+\alpha \widehat{{\nabla}_{\theta}J(\theta )},$ | (9) |

where, $\widehat{{\nabla}_{\theta}J(\theta )}$ is an approximation of the true gradient, and $\alpha $ is the learning rate.

There are two main benefits of the policy-gradient method over value-based methods [sutton2018reinforcement]: (i) By following the policy gradient algorithm, there is a chance of converging to a deterministic policy, though with a value-based algorithm, there is always a $\u03f5\%$ of taking a random action. (ii) With the value-based algorithms there is no way of learning a stochastic policy. This is important when the optimal policy is stochastic which might be the case in competitive games. Considering those benefits, numerous policy gradient methods are proposed in the literature and depending on the gradient approximation strategy, they have different applications and effectiveness. We briefly review some of the most important ones which we recall them frequently in the next sections. For a review on other policy gradient-based algorithms see [sutton2000policy, li2017deep].

REINFORCE algorithm is one of the first successful policy-gradient algorithms which proposed in 2000 [sutton2000policy]. Particularly, REINFORCE applies Monte Carlo method and uses the actual return $G:={\sum}_{t=0}^{T}{\gamma}^{t}r({s}_{t},{a}_{t})$ (T is the end of the episode) as an approximation for the ${Q}_{{\pi}_{\theta}}(s,a)$ in equation (8).

The Actor-Critic (AC) model extends REINFORCE algorithm by adding another approximator, called critic, to learn the Q-values. The critic receives state ${s}_{t}$ and returns the approximation of the ${Q}_{w}({s}_{t},{a}_{t})$. The critic is trained by calculating the TD-error ${\delta}_{t}=r({s}_{t},{a}_{t})+\gamma {Q}_{w}({s}_{t+1},{a}_{t+1})-{Q}_{w}({s}_{t},{a}_{t})$, and updating $w$ by $w=w+{\alpha}_{w}{\delta}_{t}{\nabla}_{w}{Q}_{w}({s}_{t},{a}_{t})$, in which ${\alpha}_{w}$ is the critic’s learning rate. A widely useful extension of REINFORCE and actor-critic algorithm is to subtract a baseline from the return (or the approximated return) to reduce the approximation variance. In this order, advantage function $A({s}_{t},{a}_{t})$ can be used instead of the Q-value in which value function $V(s)$ is used as the baseline. Thus, the critic learns the value-function $V(s)$ and then the advantage $A({s}_{t},{a}_{t})$ is achieved by $r({s}_{t},{a}_{t})+\gamma V({s}_{t+1})-V({s}_{t})$. Following this technique, several AC-based methods were proposed. A3C proposed in [mnih2016asynchronous] uses the advantage function and runs several instances of the actor-critic model with the advantage function and asynchronously gathers the gradients and updates the weights of a master node. Afterward, the master node broadcast the new weights to the sender node and in this way all nodes are updated asynchronously. A2C [a2c] uses the same framework, but updates the weights in synchronous manner.

### 1.2 Multi-Agent RL Notations and Formulation

We denote a multi-agent setting with tuple $$, in which $N$ is the number of agents, $\mathcal{S}$ is state space, $\mathcal{A}=\{{\mathcal{A}}_{1},\mathrm{\dots},{\mathcal{A}}_{N}\}$ is the set of actions for all agents, $P$ is the transition probability among the states, $R$ is the reward function, and $\mathcal{O}=\{{\mathcal{O}}_{1},\mathrm{\dots},{\mathcal{O}}_{N}\}$ is the set of observations for all agents. Within any type of the environment, we use $\bm{a}$ to denote the vector of actions for all agents, ${a}_{-i}$ is the vector of all agents except agent $i$, ${\tau}_{i}$ represents observation-action history of agent $i$, and $\bm{\tau}$ is the observation-action of all agents. Also, $\mathcal{T}$, $\mathcal{S}$, and $\mathcal{A}$ are the observation-action space, state space, and action space. Then, in a cooperative problem with $N$ agents with full observability of the environment, each agent $i$ at time-step $t$ observes the global state ${s}_{t}$ and uses the local stochastic policy ${\pi}_{i}$ to take action ${a}_{i}^{t}$ and then receives reward ${r}_{i}^{t}$. If the environment is fully cooperative, at each time step all agents observe a joint reward value ${r}_{t}$, i.e., ${r}_{1}^{t}=\mathrm{\cdots}={r}_{N}^{t}={r}^{t}$. If the agents are not able to fully observe the state of the system, each agent only accesses its own local observation ${o}_{i}^{t}$.

Similar to the single-agent case, each agent is able to learn the optimal Q-value or the optimal stochastic policy. However, since the policy of each agent changes as the training progresses; Therefore, the environment becomes non-stationary from the perspective of any individual agent. Basically, $P({s}^{\prime}|s,{a}_{i},{\pi}_{1},\mathrm{\dots},{\pi}_{N})\ne P({s}^{\prime}|s,{a}_{i},{\pi}_{1}^{\prime},\mathrm{\dots},{\pi}_{N}^{\prime})$ when any ${\pi}_{i}\ne {\pi}_{i}^{\prime}$ so that we lose the underlying assumption of MDP. This means that the experience of each agent involves different co-player policies, so we cannot fix them and train an agent such that any attempt to train such models results in fluctuations of the training. This makes the model training quite challenging. Therefore, the adopted Bellman equation for MARL [foerster2017stabilising] (assuming the full observability) also does not hold for the multi-agent system:

$${Q}_{i}^{\ast}(s,{a}_{i}|{\bm{\pi}}_{-i})=\sum _{{\bm{a}}_{-i}}{\bm{\pi}}_{-i}({\bm{a}}_{-i},s)\left[r(s,{a}_{i},{\bm{a}}_{-i})+\gamma \sum _{{s}^{\prime}}P({s}^{\prime}|s,{u}_{i},{\bm{a}}_{-i})\underset{{a}_{i}^{{}^{\prime}}}{\mathrm{max}}{Q}_{i}^{\ast}(s,{a}_{i}^{{}^{\prime}})\right],$$ | (10) |

where ${\bm{\pi}}_{-i}={\mathrm{\Pi}}_{j\ne i}{\pi}_{j}({a}_{j}|s)$. Due to the fact that ${\bm{\pi}}_{-i}$ changes over time as the policy of other agents changes, in MARL one cannot obtain the optimal Q-value using the classic Bellman equation.

On the other hand, since the policy of each agent changes during the training; so, one cannot use experience replay without dealing with the non-stationarity. Without experience replay, the DQN algorithm [mnih2015human] and its extensions or basically neither of the approximated value-based algorithms can be directly used. The same issue exists within AC-based algorithms which use a DQN-like algorithm for the critic. Besides, in most problems in MARL agents are not able to observe the full state of the system, which are categorized as decentralized POMDP (Dec-POMDP). Due to the partial observability and the non-stationarity of the local observations, Dec-POMDPs are even harder problems to solve and it can be shown that they are in the class of NEXP-complete problems [bernstein2002complexity]. A similar equation to (10) can be obtained for the partially observable environment too.

In the Multi-agent RL, the noise and variance of the rewards increase which results in the instability of the training. The reason is that the reward of one agent depends on the actions of other agents, and the conditioned reward on the action of a single agent can exhibit much more noise and variability than a single agent’s reward. Therefore, training a policy gradient algorithm also would not be effective in general.

Finally, we define the following notation which is used in a couple of papers in Nash equilibrium. A joint policy ${\pi}^{\ast}$ defines a Nash equilibrium if and only if:

$$\forall {\pi}_{i}\in {\mathrm{\Pi}}_{i},\forall s\in \mathcal{S},{J}_{i}({\pi}_{i}^{\ast},{\pi}_{-i}^{\ast})\ge {J}_{i}({\pi}_{i},{\pi}_{-i}^{\ast});\forall i\in \{1,\mathrm{\dots},N\}$$ | (11) |

in which ${J}_{i}(s)$ is the expected long-term return of agent $i$ in state $s$ and ${\mathrm{\Pi}}_{i}$ is the set of all possible policies for agent $i$. Basically, it means that each agent prefers to not change its policy, if wants to attain the long-term reward. Further, if the following holds for policy $\widehat{\pi}$:

$${J}_{i}(\widehat{\pi})\ge {J}_{i}({\pi}_{i}),\forall i\in \{1,\mathrm{\dots},N\},\forall {\pi}_{i}\in {\mathrm{\Pi}}_{i},\forall s\in \mathcal{S},$$ |

policy $\widehat{\pi}$ is called Pareto-optimal.

## 2 IQL and Extensions

The most naive approach to solve the multi-agent RL problem is to treat each agent independently such that it considers the rest of the agents as part of the environment. This idea is formalized in independent Q-Learning (IQL) algorithm [tan1993multi], in which each agent accesses its local observation and the overall agents try to maximize a joint reward. Each agent runs a separate Q-learning algorithm [watkins1992q] (or it can be the newer extensions like DQN [mnih2015human], DRQN [hausknecht2015deep], etc.). IQL is an appealing algorithm since (i) it does not have the scalability problem with increasing the number of agents, (ii) each agent only needs its local history of observations during the training and the inference time. The tabular IQL usually works well in practice for small size problems [matignon2012independent, matignon2012independent]; however, in the case of function approximation, especially deep neural network (DNN) it fails. One of the main reasons for this failure is the need for the replay memory to stabilize the training with DNNs [foerster2017stabilising]. In an extension of IQL, Distributed Q-learning [lauer2000algorithm] considers a decentralized fully cooperative multi-agent problem such that all agents observe the full state of the system and do not know the actions of the other agents, although in the training time it assumes the joint action is available for all agents. The joint action is executed in the environment and it returns the joint reward that each agent receives. This algorithm updates the Q-values only when there is a guaranteed improvement, assuming that the low returns are the result of a bad exploration of the teammates. In other words, it maximizes over the possible actions for agent $i$, assuming other agents selected the local optimal action, i.e., for a given joint action ${\bm{a}}_{t}=({a}_{1}^{t},\mathrm{\dots},{a}_{N}^{t})$, it updates the Q-values of agent $i$ by:

$${q}_{i}^{t+1}(s,a)=\{\begin{array}{cc}{q}_{i}^{t}(s,a)\hfill & \text{if}s\ne {s}_{t}\text{or}a\ne {a}_{t},\hfill \\ \mathrm{max}\{{q}_{i}^{t}(s,a),r({s}_{t},{\bm{a}}_{t})+\gamma {\mathrm{max}}_{{a}^{\prime}\in A}{q}_{i}^{t}(\delta ({s}_{t},{\bm{a}}_{t}),{a}^{\prime})\}\hfill & \text{otherwise},\hfill \end{array}$$ | (12) |

in which ${q}_{i}^{t}(s,a)={\mathrm{max}}_{\bm{a}=\{{a}^{1},\mathrm{\dots},{a}^{N}\},{a}^{j}=a}Q(s,\bm{a})$ and $\delta ({s}_{t},{\bm{a}}_{t})$ is the environment function which results in ${s}_{t+1}$. Therefore, Distributed Q-learning completely ignores the low rewards. Thus, it results in an overestimated Q-values. This issue and the course of dimensionality results in poor performance in the problems with high dimension.

Hysteretic Q-learning [matignon2007hysteretic] considers the same problem and tries to obtain a good policy assuming that the low return might be the result of stochasticity in the environment so that it does not ignore them as Distributed Q-Learning does. In particular, when the TD-error is positive, it updates the Q-values by the learning rate $\alpha $, and otherwise, it updates the Q-values by the learning rate $$. Thus, the model is also robust to negative learning due to the team-mate explorations. [bowling2002multiagent] also propose to use a variable learning rate to improve the performance of tabular IQL. In another extension for the IQL, [fuji2018deep] propose to train one of the agents at each time and fix the policy of other agents within a periodic manner in order to stabilize the environment. So, during the training, other agents do not change their policies and the environment from the view-point of the single agent is stationary.

DQN algorithm [mnih2015human] utilized *reply memory* and *target network* and was able to attain super-human level control on most of the Atari games. The classical IQL uses the tabular version, so one naive idea could be using DQN algorithm instead of each single Q-learner. [tampuu2017multiagent] implemented this idea and was one of the first papers which took the benefit of the neural network as a general powerful approximator in IQL-like setting. Specifically, this paper analyzes the performance of the DQN in a decentralized two-agent game for both competitive and cooperative settings. They assume that each agent observes the full state (the video of the game), takes an action by its own policy and the reward values are also known to both agents. The paper is mainly built on the *Pong* game (from Atari-2600 environment [bellemare2013arcade]) in which by changing the reward function the competitive and cooperative behaviors are obtained. In the competitive version, each agent that drops the ball loses a reward-point, and the opponent wins the reward-point so that it is a zero-sum game. In the cooperative setting, once either of agents drops the ball, both agents lose a reward-point. The numerical results show that in both cases the agents are able to learn how to play the game very efficiently, that is in the cooperative setting, they learn to keep the ball for long periods, and in the competitive setting the agents learn to quickly beat the competitor.

Replay memory is one of the core elements in the DQN algorithm. It helps to stabilize the training of the neural network and improves the sample efficiency of the history of observations. However, due to the non-stationarity of the environment, using the experience replay in a multi-agent environment is problematic. Basically, the policy that generates the data for the replay memory is different than the current policy so that the learned policy of each agent can be misleading. In order to address this issue, [foerster2016learning] disables the replay memory part of the algorithm, or in [leibo2017multi] the old transitions are discarded and the replay memory uses only the recent experiences. Even though these approaches help to reduce the non-stationarity of the environment, but both limit the sample efficiency. To resolve this problem, [foerster2017stabilising] propose two algorithms to stabilize the reply memory in IQL-type algorithms. They consider a fully cooperative MARL with local observation-action. In the first approach, each transition is augmented with the probability of choosing the joint action. Then, during the loss-calculation, the importance sampling correction is calculated using the current policy. Thus, the loss function is changed to:

$$L({\theta}_{i})=\sum _{k=1}^{b}\frac{{\bm{\pi}}_{-i}^{{t}_{c}}({\bm{a}}_{-i},s)}{{\bm{\pi}}_{-i}^{{t}_{i}}({\bm{a}}_{-i},s)}\left[{({y}_{i}^{DQN}-Q(s,{a}_{i};{\theta}_{i}))}^{2}\right],$$ | (13) |

in which ${\theta}_{i}$ is the policy parameters for agent $i$, ${t}_{c}$ is the current time-step, and ${t}_{i}$ is the time of collecting ${i}^{\text{th}}$ sample. In this way, the effect of the transitions generated from dissimilar policies is regularized on gradients. In the second algorithm, named FingerPrint, they propose augmenting the replay memory with some parts of the policies of the other agents. However, the number of parameters in DNN is usually large and as a result, it is intractable in practice. Thus, they propose to augment each instance in the replay memory by the iteration number $e$ and the $\u03f5$ of the $\u03f5$-greedy algorithm. In the numerical experiments, they share the weights among the agent, while the id of each agent is also available as the input. They provide the results of two proposed algorithms plus the combination of them on the StarCraft game [samvelyan19smac] and compare the results by a classic experience replay and one no-experience replay algorithm. They conclude that the second algorithm obtains better results compared to the other algorithm.

[omidshafiei2017deep] proposes another extension of the replay memory for the MARL. They consider multi-task cooperative games, with independent partially observable learners such that each agent only knows its own action, with a joint reward. An algorithm, called HDRQN, is proposed which is based on the DRQN algorithm [hausknecht2015deep] and the Hysteretic Q-learning [matignon2007hysteretic]. Also, to alleviate the non-stationarity of MARL, the idea of Concurrent Experience Replay Trajectories (CERTs) is proposed, in which the experience replay memory gathers the experiences of all the agents in any period of one episode and also during the sampling of a mini-batch, it obtains the experiences of one period of all agents together. Since they use LSTM [hochreiter1997long], the experiences in the replay memory are zero-padded (adds zero to the end of the experiments with smaller sizes to make the size of all experiments equal). Moreover, in the multi-task version of HRDQN, there are different tasks that each has its own transition probability, observation, and reward function. During the training, each agent observes the task ID, while it is not accessible in the inference time. To evaluate the model, a two-player game is utilized, in which agents are rewarded only when all the agents simultaneously capture the moving target. In order to make the game partially observable, a flickering screen is used such that with $30\%$ chance the screen is flickering. The actions of the agents are moving north, south, west, east, or waiting. Additionally, actions are noisy, i.e. with $10\%$ probability the agent might act differently than what it wanted.

## 3 Fully Observable Critic

Non-stationarity of the environment is the main issue in MARL problems. One of the common approaches to address this issue is using a fully observable critic. The fully observable critic involves the observation and actions of all agents and as a result the environment is stationary even though the policy of other agents changes. In other words, $P({s}^{\prime}|s,{a}_{1},\mathrm{\dots},{a}_{N},{\pi}_{1},\mathrm{\dots},{\pi}_{N})=P({s}^{\prime}|s,{a}_{1},\mathrm{\dots},{a}_{N},{\pi}_{1}^{\prime},\mathrm{\dots},{\pi}_{N}^{\prime})$ even if ${\pi}_{i}\ne {\pi}_{i}^{\prime}$, since the environment returns an equal next-state regardless of the changes in the policy of other agents. So, once the critic is fully observable, the non-stationarity is resolved and it can be used as a good leader for local actors. Using this idea, [lowe2017multi] proposes a model-free multi-agent reinforcement learning algorithm to the problem in which agent $i$ at time step $t$ of execution accesses its own local observation ${o}_{i}^{t}$, local actions ${a}_{i}^{t}$, and local rewards ${r}_{i}^{t}$. They consider cooperative, competitive, and mixed competitive and cooperative games, and proposed Multi-agent DDPG (MADDPG) algorithm in which each agent trains a DDPG algorithm such that the actor ${\pi}_{i}({o}_{i};{\theta}_{i})$ with policy weights ${\theta}_{i}$ observes the local observations while the critic ${Q}_{i}^{\mu}$ is allowed to access the observations, actions, and the target policies of all agents in the training time. Then, each critic concatenates all states-actions together as the input and using the local reward obtains the corresponding Q-value. The critic is trained by minimizing a DQN-like loss function:

$L({\mu}_{i})={\mathbb{E}}_{{\bm{x}}^{t},a,r,{\bm{x}}^{t+1}}\left[{\left({Q}_{i}({\bm{x}}^{t},{a}_{1}^{t},\mathrm{\dots},{a}_{N}^{t};{\mu}_{i})-y\right)}^{2}\right],$ | ||

$y={r}_{i}^{t}+{\gamma {Q}_{i}({\bm{x}}^{t},{\overline{a}}_{1}^{t+1},\mathrm{\dots},{\overline{a}}_{N}^{t+1};{\overline{\mu}}_{i})|}_{{\overline{a}}_{j}^{t+1}=\overline{\pi}({o}_{j}^{t+1})},$ |

in which ${\overline{\pi}}_{j}$ is the target policy and $\overline{\mu}$ is the target critic. As a result, the *critic of each agent deals with a stationary environment*, and in the inference time, it only needs to access the local information. MADDPG is compared with the decentralized trained version of DDPG [lillicrap2015continuous], DQN [mnih2015human], REINFORCE [sutton2000policy], and TRPO [schulman2015trust] algorithm in a set of grounded communication environments from particle environment [haarnoja2018soft], e.g., predator-prey, arrival task, etc. The continues space-action predator-prey environment from this environment is usually considered as a benchmark for MARL problem with local observation and cooperative rewards. In the most basic version of the predator-prey, there are two predators which are randomly placed in a $5\times 5$ grid, along with one prey which also randomly is located in the grid. Each predator observes its direct neighbor cells, i.e., $3\times 3$ cells, and the goal is to catch the prey together to receive a reward, and in all other situations, each predator obtains a negative reward.

Several extensions of MADDPG algorithm are proposed in the literature and we review some of them in the rest of this section. [ryu2018multi] proposes an actor-critic model with local actor and critic for a DEC-POMDP problem, in which each agent observes a local observation ${o}_{i}$, observes its own reward ${r}_{i}:\mathcal{S}\times {\mathcal{A}}_{1}\times \mathrm{\cdots}\times {\mathcal{A}}_{N}\to \mathbb{R}$, and learns a deterministic policy ${\mu}_{{\theta}_{i}}:{\mathcal{O}}_{i}\to {\mathcal{A}}_{i}$. The goal for agent $i$ is to maximize its own discounted return ${R}_{i}={\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}{r}_{i}^{t}$. An extension of MADDPG with generative cooperative policy network, called MADDPG-GCPN, is proposed in which there is an extra actor network ${\mu}_{i}^{c}$ to generate action samples of other agents. Then, the critic uses the replay buffer filled with sampled actions from GCPN, and not from the actor of the other agents. So, there is no need to share the target policy of other agents during the training time. Further, the algorithm modified in a way such that the critic can use either immediate individual or joint reward during the training. They presented a new version of the predator-prey game in which each agent receives an individual reward plus a shared one if they catch the prey. The experimental analysis on the predator-prey game and a controlling energy storage systems problem shows that the standard deviation of obtained Q-values is lower in MADDPG-GCPN compared to MADDPG.

As another extension of MADDPG, [chu2017parameter] considers multi-agent cooperative problems with $N$ agents and proposes three actor-critic algorithms based on MADDPG. The first one assumes all the agents know the global reward and shares the weights between agents so that it actually includes one actor and one critic network. The second algorithm assumes the global reward is not shared each agent updates its own critic using the local reward so that there are $N$ critic networks. Although, agents share their weights so that there is only one actor network and in sum $N+1$ networks are trained. The third algorithm also assumes non-shared global reward, though uses only two networks, one actor network and one critic network such that the critic has $N$ heads in which head $i\in \{1,\mathrm{\dots},N\}$ provides the Q-value of the agent $i$. They compare the results of their algorithms on three new games with MADDPG, PPO [schulman2017proximal], and PS-TRPO algorithms (PS-TRPO is the TRPO algorithm [schulman2015trust] with parameter sharing, see [sukthankar2017autonomous] for more details).

[mao2018modelling] presents another algorithm based on MADDPG for a cooperative game, called ATT-MADDPG which considers the same setting as in [lowe2017multi]. It enhances the MADDPG algorithm by adding an attention layer in the critic network. In the model, each agent trains a critic which accesses the actions and observations of all agents. To obtain the Q-value, an attention layer is added on the top of the critic model to determine the corresponding Q-value. In this way, at agent $i$, instead of just using $[{o}_{1}^{t},\mathrm{\dots},{o}_{N}^{t}]$ and $[{a}_{i}^{t},{a}_{-i}^{t}]$ for time step $t$, ATT-MADDPG considers $K$ combinations of possible action-vector ${a}_{-i}^{t}$, and obtains the corresponding $K$ Q-values. Also, using an attention model, it obtains the weights of all $K$ action-sets such that the hidden vector ${h}_{i}^{t}$ of the attention model is generated via the actions of other agents (${a}_{-i}^{t}$). Then, the attention weights are used to obtain the final Q-value by a weighted sum of the $K$ possible Q-values. Indeed, their model combines the MADDPG algorithm with k-head Q-value model [van2017hybrid]. They provide some numerical experiments on cooperative navigation, predator-prey, a packet-routing problem, and compare the performance with MADDPG and few other algorithms. Moreover, the effect of small or large $K$ is analyzed within each environment. In [wang2019r] again the problem setting is considered as the MADDPG, though they assume a limit on the communication bandwidth. Due to this limitation, the agents are not able to share all the information, e.g., they cannot share their locations on "arrival task" (from Multi-Agent Particle Environment [mordatch2017emergence]), which limits the ability of the MADDPG algorithm to solve the problem. To address this issue, they propose R-MADDPG, in which a recurrent neural network is used to remember the last communication in both actor and critic. In this order, they modified the replay memory such that each tuple includes (${o}_{i}^{t},{a}_{i}^{t},{o}_{i}^{t+1},{r}_{i}^{t},{h}_{i}^{t},{h}_{i}^{t+1}$), in which ${h}_{t}^{i}$ is the hidden state of the actor network. The results are compared with MADDPG over the "arrival task" with communication limit, where each agent can select either to send a message or not, and the message is simply the position of the agent. With the fully observable state, their algorithm works as well as MADDPG. On the other hand, within the partially observable environment, recurrent actor (with fully connected critic) does not provide any better results than MADDPG; though, applying both recurrent actor and recurrent critic, R-MADDPG obtains higher rewards than MADDPG.

Since MADDPG concatenates all the local observations in the critic, it faces the curse of dimensionality with increasing the number of agents. To address this issue, [iqbal2018actor] proposed Multiple Actor Attention-Critic (MAAC) algorithm which efficiently scales up with the number of agents. The main idea in this work is to use an attention mechanism [choi2017multi, jiang2018learning] to select relevant information for each agent during the training. In particular, agent $i$ receives the observations, $o=({o}_{1},\mathrm{\dots},{o}_{N})$, and actions, $a=({a}_{1},\mathrm{\dots},{a}_{N})$ from all agents. The value function parameterized by $\psi $, ${Q}_{i}^{\psi}(o,a)$, is defined as a function of agent $i$’s observation-action, as well as the information received from the other agents:

${Q}_{i}^{\psi}(o,a)={f}_{i}({g}_{i}({o}_{i},{a}_{i}),{x}_{i}),$ |

where ${f}_{i}$ is a two-layer perceptron, ${g}_{i}$ is a one-layer embedding function, and ${x}_{i}$ is the contribution of other agents. In order to fix the size of ${x}_{i}$, it is set equal to the weighted sum of other agents’ observation-action:

${x}_{i}={\displaystyle \sum _{j\ne i}}{\alpha}_{j}{v}_{j}={\displaystyle \sum _{j\ne i}}{\alpha}_{j}h(V{g}_{j}({o}_{j},{a}_{j})),$ |

where ${v}_{j}$ is a function of the embedding of agent $j$, encoded with an embedding function and then linearly transformed by a shared matrix $V$, and $h$ is the activation function. Denoting ${e}_{j}={g}_{j}({o}_{j},{a}_{j})$, using the query-key system [vaswani2017attention], the attention weight ${\alpha}_{j}$ is proportional to:

${\alpha}_{j}\propto \mathrm{exp}({e}_{j}^{T}{W}_{k}^{T}{W}_{q}{e}_{i}),$ |

where ${W}_{q}$ transforms ${e}_{i}$ into a “query” and ${W}_{k}$ transforms ${e}_{j}$ into a “key”. The critic step updates the $\psi $ through minimizing the following loss function:

${\mathcal{L}}_{Q}(\psi )={\displaystyle \sum _{i=1}^{N}}{\mathbb{E}}_{(o,a,r,{o}^{\prime})\sim D}\left[{({Q}_{i}^{\psi}(o,a)-{y}_{i})}^{2}\right],$ |

where,

${y}_{i}={r}_{i}+\gamma {\mathbb{E}}_{{a}^{\prime}\sim {\pi}_{\overline{\theta}}({o}^{\prime})}\left[{Q}_{i}^{\overline{\psi}}({o}^{\prime},{a}^{\prime})-\alpha \mathrm{log}({\pi}_{\overline{{\theta}_{i}}}({a}_{i}^{\prime}|{o}_{i}^{\prime}))\right],$ |

where $\overline{\psi}$ and $\overline{\theta}$ are the parameters of the target critics and target policies respectively. In order to encourage exploration, they also use the idea of soft-actor-critic (SAC) [haarnoja2018soft]. MAAC is compared with COMA [foerster2018counterfactual] (will be discussed shortly), MADDPG, and their updated version with SAC, as well as an independent learner with DDPG [lillicrap2015continuous], over two environments: treasure collection and rover-tower. MAAC obtains better result than the other algorithms such that the performance gap becomes shorter as the number of agents increases.

[jiang2018graph] assumes a graph connection among the agents such that each node is an agent. A partially observable environment is assumed in which each agent observes a local observation, takes a local action and receive a local reward, while agents can share their observations with their neighbors and the weights of all agents are shared. A graph convolutional reinforcement learning for cooperative multi-agent is proposed. The multi-agent system is modeled as a graph such that agents are the nodes, and each has some features which are the encoded local observations. A multi-head attention model is used as the convolution kernel to obtain the connection weights to the neighbor nodes. To learn Q-function an end-to-end algorithm, named DGN, is proposed, which uses the centralized training and distributed execution approach. The goal is to maximize the sum of the reward of all agents. During the training, DGN allows the gradients of one agent flow to K-neighbor agents and its receptive field to stimulate cooperation. In particular, DGN consists of three phases: (i) an observation encoder, (ii) convolutional layer, and (iii) Q-network. The encoder (which is a simple MLP, or convolution layer if it deals with images) receives observation ${o}_{i}^{t}$ of agent $i$ at time $t$ and encodes it to a feature vector ${h}_{i}^{t}$. In phase (ii), the convolutional layer integrates the local observations of $K$ neighbors to generate a latent feature $h^{\prime}{}_{i}{}^{t}$. In order to obtain the latent vector, an attention model is used to make the input independent of the number of input features. The attention model gets all feature vectors of the $K$-neighbors, generates the attention weights, and then calculates the weighted sum of the feature vectors to obtain $h^{\prime}{}_{i}{}^{t}$. Another convolution layer may be added to the model to increase the receptive field of each agent, such that $h^{\prime}{}_{i}{}^{t}$ are the inputs of that layer and $h^{{}^{\prime \prime}}{}_{i}{}^{t}$ are the outcomes. Finally, in phase (iii), the Q-network provides the Q-value of each possible action. Based on the idea of DenseNet [huang2017densely], the Q-network gathers observations and all the latent features and concatenates them for the input. (Note that this procedure is followed for each agent, and the weights of the network are shared among all agents.) In the loss function, besides the Q-value loss function, a penalized KL-divergence is added. This function measures the changes between the current attention weights and the next state attention weights and tries to not allow drastic changes in the attention weight. Using the trained network, in the execution time each agent $i$ observes ${o}_{i}^{t}$ plus the observation of its $K$ neighbors (${\{{o}_{tot}\}}_{j\in \mathbb{N}(i)}^{t}$) to get an action. The results of their algorithm are compared with DQN, CommNet [sukhbaatar2016learning], and MeanField Q-Learning [yang2018mean], on Jungle, Battle, and Routing environments.

[yang2018mean] considers the multi-agent RL problems in the case that there exists a huge number of agents collaborating or competing with each other to optimize some specific long-term rewards. They propose Mean Field Reinforcement Learning framework, where every single agent only considers an average effect of its neighborhoods, instead of exhaustive communication with all other agents within the population. Two algorithms namely Mean Field Q-learning (MF-Q) and Mean Field Actor Critic (MF-AC) are developed following the mean-field idea. There exists a single state visible to all agents, and the local reward and the local action of all agents are also visible to the others during the training. Applying Tylor’ theorem, it is proved that ${Q}^{i}(s,a)$ can be approximated by ${Q}^{i}(s,{a}^{i},{\overline{a}}^{i})$, where $a$ concatenates the action of all agents, ${a}^{i}$ is the action of agent $i$, and ${\overline{a}}^{i}$ denotes the average action from the neighbors. Furthermore, utilizing the contraction mapping technique, it is shown that mean-field Q-values converge to the Nash Q-values under some particular assumptions. The proposed algorithms are tested on three different problems: Gaussian Squeeze and the Ising Model (a framework in statistical mechanics to mathematically model ferromagnetism), which are cooperative problems, and the battle game, which is a mixed cooperative-competitive game. The numerical results show the effectiveness of the proposed method in the case of many-agent RL problems.

[foerster2018counterfactual] proposes COMA, a model with a single centralized critic which uses the global state, the vector of all actions, and a joint reward. This critic is shared among all agents, while the actor is trained locally for each agent with the local observation-action history. The joint reward is used to train $Q({s}^{t},[{a}_{i}^{t},{a}_{-i}^{t}])$. Then, for the agent $i$, with the fixed ${a}_{-i}^{t}$ the actor uses a counterfactual baseline

$$b({s}^{t},{a}_{-i}^{t})=\sum _{{\widehat{a}}_{i}^{t}}{\pi}_{i}({\widehat{a}}_{i}^{t}|{o}_{i}^{t})Q({s}^{t},[{\widehat{a}}_{i}^{t},{a}_{-i}^{t}])$$ |

to obtain contribution of action ${a}_{i}^{t}$ via advantage function $A(s,{a}_{i}^{t})=Q(s,[{a}_{-i}^{t},{a}_{i}])-b({s}^{t},{a}_{-i}^{t})$. Also, each actor shares their weights with other agents, and uses gated recurrent unit (GRU) [cho2014learning] to utilize the history of the observation. They present the results of their algorithm on StarCraft game [samvelyan19smac] and compare COMA with central-V, central-QV, and two implementations of independent actor-critic (IAC).

[yang2018cm3] considers a multi-agent cooperative problem in which in addition to the cooperative goal, each agent needs to attain some personal goals. Each agent observes a local observation, local reward corresponding to the goal, and its own history of actions, while the actions are executed jointly in the environment. This is a common situation in problems like autonomous car driving. For example, each car has to reach a given destination and all cars need to avoid the accident and cooperate in junctions. The authors propose an algorithm (centralized training, decentralized execution) called CM3, with two phases. In the first phase, one single network is trained for all agents to learn personal goals. The output of this network, a hidden layer, is passed to a given layer of the second network to initialize it, and the goal of the second phase is to attain the global goal. Also, since the collective optimal solution of all agents is not necessarily optimal for every individual agent, a credit assignment approach is proposed to obtain the global solution of all agents. This credit assignment, motivated by [foerster2018counterfactual], is embedded in the design of the second phase. In the first phase of the algorithm, each agent is trained with an actor-critic algorithm as a single agent problem learning to achieve the given goal of the agent. In this order, the agent is trained with some randomly assigned goal very well, i.e., agent $i$ wants to learn the local policy $\pi $ to maximize ${J}_{i}(\pi )$ for any arbitrary given goal. All agents share the weights of the policy, so the model is reduced to maximize ${J}_{local}(\pi )={\sum}_{i=1}^{N}{J}_{i}(\pi )$. Using the advantage approximation, the update is performed by ${\nabla}_{\theta}{J}_{local}(\pi )=\mathbb{E}\left[{\sum}_{i=0}^{N}{\nabla}_{\theta}\mathrm{log}{\pi}_{i}({a}_{i}|{o}_{i},{g}_{i})(R(s,{a}_{i},{g}_{i})+\gamma V({o}_{i}^{t+1},{g}_{i})-V({o}_{i}^{t},{g}_{i}))\right]$, in which ${g}_{i}$ is the goal of agent $i$. The second phase starts by the pre-trained agents, and trains a new global network in the multi-agent setting to achieve a cooperative goal by a comprehensive exploration. The cooperative goal is the sum of local rewards, i.e., ${J}_{global}(\pi )={\mathbb{E}}_{\pi}\left[{\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}{R}_{g}^{t}\right]$, in which ${R}_{g}^{t}={\sum}_{i=1}^{N}r({o}_{i}^{t},{a}_{i}^{t},{g}_{i})$. In the first phase, each agent only observes the part of the state that involves the required observation to complete the personal task. In the second phase, additional observations are given to each agent to achieve the cooperative goal. This new information is used to train the centralized critic and also used in the advantage function to update the actor’s policy. The advantage function uses the counterfactual baseline [foerster2018counterfactual], so that the global objective is updated by ${\nabla}_{\theta}{J}_{global}(\pi )=\mathbb{E}\left[{\sum}_{i=0}^{N}{\nabla}_{\theta}\mathrm{log}{\pi}_{i}({a}_{i}|{o}_{i},{g}_{i})(Q(s,a,g)-b(s,{a}^{-i},g))\right]$. Finally, a combined version of the local and global models are used in this phase to train the model with the centralized critic. They present the experiments on an autonomous vehicle negotiation problem, and compare the results with COMA [foerster2018counterfactual] and independent actor-critic learners model.

[sartoretti2019distributed] extends A3C [mnih2016asynchronous] algorithm for a centralized actor and a centralized critic. The proposed algorithm is based on centralized training and decentralized execution, in a fully observable environment. They consider a construction problem, TERMES [petersen2012termes], in which each agent is responsible to gather, carry, and place some blocks to build a certain structure. Each agent observes the global state plus its own location, takes its own local action and executes it locally (no joint action selection/execution). Each agent receives its local sparse reward, such that the reward is +1 if it puts down a block in a correct position and -1 if it picks up a block from a correct position. Any other actions receive 0 reward. During the training, all agents share the weights (as it is done in A3C) for both actor and critic models and train a central agent asynchronously. In the execution time, each agent uses one copy of the learned policy without communicating with other agents. The goal of all agents is to achieve the maximum common reward. The learned policy can be executed in an arbitrary number of agents, and each agent can see the other agents as moving elements, i.e., as part of the environment.

Finally, in [kim2019learning] a MARL problem is considered under the following two assumptions: 1) The communication bandwidth among the agents is limited. 2) There exists a shared communication medium such that at each time step only a subset of agents is able to use it in order to broadcast their messages to the other agents. Therefore, communication scheduling is required to determine which agents are able to broadcast their messages. Utilizing the proposed framework, which is called SchedNet, the agents are able to schedule themselves, learn how to encode the received messages, and also learn how to pick actions based on these encoded messages. SchedNet focuses on centralized training and distributed execution. Therefore, in training the global state is available to the critic, while the actor is local for each agent and the agents are able to communicate through the limited channel. To control the communication, a Medium Access Control (MAC) protocol is proposed, which uses a weight-based scheduler (WSA) to determine which nodes can access to the shared medium. The local actor $i$ contains three networks: 1) message encoder, which takes the local observation ${o}_{i}$ and outputs the messages ${m}_{i}$. 2) weight generator, which takes the local observation ${o}_{i}$ and outputs the weight ${w}_{i}$. Specifically, the ${w}_{i}$ determines the importance of observations in node $i$. 3) action selector. This network receives the observation ${o}_{i}$, encoded messages ${m}_{i}$ plus the information from the scheduling module, which selects $K$ agents who are able to broadcast their messages. Then maps this information to the action ${u}_{i}$. A centralized critic is used during the training to criticize the actor. In particular, the critic receives the global state of the environment and the weigh vector $W$ generated by the wight generator networks as an input, and the output will be $Q(S,W)$ as well as $V(S)$. The first one is used to update the actor wight ${w}_{i}$ while the second one is used for adjusting the weights of two other networks, namely action selector and message encoder. Experimental results on the predator-prey, cooperative-communication,and navigation task demonstrate that intelligent communication scheduling can be helpful in MARL.

## 4 Value Function Factorization

Consider a cooperative MARL problem in which we are allowed to share all information among the agents and there is no communication limitation among the agents. Further, let’s assume that we are able to deal with the huge action space. In this scenario, a centralized RL approach can be used to solve the problem, i.e., all state observations are merged together and the problem is reduced to a single agent problem with a combinatorial action space. However, [sunehag2018value] shows that naive centralized RL methods fail to find the global optimum, even if we are able to solve the problems with such huge state and action space. The issue comes from the fact that some of the agents may get lazy and not learn and cooperate as they are supposed to. This may lead to failure of the whole system. One possible approach to address this issue is to determine the role of each agent in the joint reward and then somehow isolate its share out of it. This category of algorithms is called Value Function Factorization.

In POMDP settings, if the optimal reward shaping is available, the problem reduces to train several independent learners, which simplifies the learning. Therefore, having a reward shaping model would be appealing for any cooperative MARL. Although, it is not easy to divide the received reward among the agents based on their contribution. Following this idea, the rest of this section discusses the corresponding algorithms.

In the literature of tabular RL, there are two common approaches for reward shaping: (i) *difference rewards* [agogino2004unifying] which tries to isolate the reward of each agent from the joint reward, i.e. ${\widehat{r}}_{i}=r-{r}_{-i}$, where ${r}_{-i}$ denotes the other agents’ share than agent $i$ in the global reward,
(ii) * Potential-based reward shaping* [ng1999policy]. In this class of value function factorization methods, term $r+\mathrm{\Phi}({s}^{\prime})-\mathrm{\Phi}(s)$ is used instead of mere $r$, in which $\mathrm{\Phi}(s)$ defines the desirability of the agent to be at state $s$. This approach is also extended for online POMDP settings [eck2016potential] and multi-agent setting [devlin2011theoretical], though defining the potential function is challenging and usually needs specific domain knowledge. In order to address this issue, [devlin2014potential] combines these approaches and proposes two tabular algorithms.
Following this idea, several value function factorization algorithms are proposed to automate reward shaping and avoid the need for a field expert, which are summarized in the following.

[sunehag2018value] considers a fully cooperative MARL (so a single shared reward exists) in which each agent observes its own state and action history. An algorithm, called VDN, is proposed to decompose the value function for each agent. Intuitively, VDN measures the impact of each agent on the observed joint reward. It is assumed that the joint action-value function ${Q}_{tot}$ can be additively decomposed into $N$ Q-functions for $N$ agents, in which each Q-function only relies on the local state-action history, i.e.,

$${Q}_{tot}=\sum _{i=1}^{N}{Q}_{i}({\tau}_{i},{u}_{i},{\theta}_{i})$$ | (14) |

In other words, for the joint observation-action history $\bm{\tau}$, it assumes validity of the individual-global-max (IGM) condition. Individual action-value functions ${[{Q}_{i}:\mathcal{T}\times \mathcal{U}]}_{i=1}^{N}$ satisfies IGM condition for the joint action-value function ${Q}_{tot}$, if:

$$\underset{\bm{u}}{\mathrm{arg}\mathrm{max}}{Q}_{tot}(\bm{\tau},\bm{u})=\left(\begin{array}{c}\hfill {\mathrm{arg}\mathrm{max}}_{{u}_{1}}{Q}_{1}({\tau}_{1},{u}_{1})\hfill \\ \hfill \mathrm{\vdots}\hfill \\ \hfill {\mathrm{arg}\mathrm{max}}_{{u}_{N}}{Q}_{N}({\tau}_{N},{u}_{N})\hfill \end{array}\right),$$ | (15) |

in which $\bm{\tau}$ is the vector of local observation of all agents, and $u$ is the vector of actions for all agents. Therefore, each agent observes its local state, obtains the Q-values for its action, selects an action, and then the sum of Q-values for the selected action of all agents provides the total Q-value of the problem. Using the shared reward and the total Q-value, the loss is calculated and then the gradients are backpropagated into the networks of all agents. In the numerical experiments, a recurrent neural network with dueling architecture [wang2015dueling] is used to train the model. Also, two extensions of the model are analyzed: (i) shared the policy among the agent, by adding the one-hot-code of the agent id to state input, (ii) adding information channels to share some information among the agents. Finally, VDN is compared with independent learners, and centralized training, in three versions of the two-player 2D grid.

QMIX [rashid2018qmix] considers the same problem as VDN does, and proposed an algorithm which is basically an improvement over VDN [sunehag2018value]. As mentioned, VDN adds some restrictions to have the additivity of the Q-value and further shares the action-value function during the training. QMIX also shares the action-value function during the training (a centralized training algorithm, decentralized execution); however, adds the below constraint to the problem:

$$\frac{\partial {Q}_{tot}}{\partial {Q}_{i}}\ge 0,\forall i,$$ | (16) |

which enforces positive weights on the mixer network, and as a result, it can guarantee (approximately) monotonic improvement. Particularly, in this model, each agent has a ${Q}_{i}$ network and they are part of the general network (${Q}_{tot}$) that provides the Q-value of the whole game. Each ${Q}_{i}$ has the same structure as DRQN [hausknecht2015deep], so it is trained using the same loss function as DQN. Besides the monotonicity constraint over the relationship between ${Q}_{tot}$ and each ${Q}_{i}$, QMIX adds some extra information from the global state plus a non-linearity into ${Q}_{tot}$ to improve the solution quality. They provide numerical results on StarCraft II and compare the solution with VDN.

Even though VDN and QMIX cover a large domain of MARL problems, the existing restrictions for these two do not hold for all problems. To the end of relaxing these restrictions, [son2019qtran] proposes QTRAN algorithm. The general settings are the same as VDN and QMIX (i.e., general DEC-POMDP problems in which each agent has its own partial observation, action history, and all agents share a joint reward). The key idea here is that the actual ${Q}_{tot}$ may be different than ${\sum}_{i=1}^{N}{Q}_{i}({\tau}_{i},{u}_{i},{\theta}_{i})$. However, they consider an alternative joint action-value ${Q}_{tot}^{{}^{\prime}}$, assumed to be factorizable by additive decomposition. Then, to fill the possible gap between ${Q}_{tot}$ and ${Q}_{tot}^{{}^{\prime}}$ they introduce

$${V}_{tot}=\underset{\bm{u}}{\mathrm{max}}{Q}_{tot}(\bm{\tau},\bm{u})-\sum _{i=1}^{N}{Q}_{i}({\tau}_{i},{\overline{u}}_{i}),$$ | (17) |

in which ${\overline{u}}_{i}$ is ${\mathrm{arg}\mathrm{max}}_{{u}_{i}^{{}^{\prime}}}{Q}_{i}({\tau}_{i},{u}_{i}^{{}^{\prime}})$. Given, $\overline{\bm{u}}={[\overline{{u}_{i}}]}_{i=1}^{N}$, they prove that

$$\sum _{i=1}^{N}{Q}_{i}({\tau}_{i},{u}_{i},{\theta}_{i})-{Q}_{tot}(\bm{\tau},\bm{u})+{V}_{tot}(\bm{\tau},\bm{u})=\{\begin{array}{cc}0\hfill & \bm{u}=\overline{\bm{u}}\hfill \\ \ge 0\hfill & \bm{u}\ne \overline{\bm{u}}\hfill \end{array}$$ | (18) |

Based on this theory, three networks are built: individual ${Q}_{i}$, ${Q}_{tot}$, and the joint regularizer ${V}_{tot}$ and three loss functions are demonstrated to train the networks. The local network at each agent is just a regular value-based network, with the local observation which provides the Q-value of all possible actions and runs locally at the execution time. Both ${Q}_{tot}$ and the regularizer networks use hidden features from the individual value-based network to help sample efficiency. In the experimental analysis, the comparisons of QTRAN with VDN and QMIX on Multi-domain Gaussian Squeeze [holmesparker2014exploiting] and modified predator-prey [stone2000multiagent] is provided.

Within the cooperative setting with the absence of joint reward, the reward-shaping idea can be applied too. Specifically, assume at time step $t$ agent $i$ observes its own local reward ${r}_{i}^{t}$. In this setting, [mguni2018inducing] considers a MARL problem in which each agent observes the full state, takes its local action based on a stochastic policy. A general reward-shaping algorithm for the multi-agent problem is discussed and proof for obtaining the Nash equilibrium is provided. In particular, a meta-agent (MA) is introduced to modify the agent’s reward functions to get the convergence to an efficient Nash equilibrium solution. The MA initially does not know the parametric reward modifier and learns it through the training. Basically, MA wants to find the optimal variables $w$ to reshape the reward functions of each agent, though it only observes the corresponding reward of chosen $w$. With a given $w$, the MARL problem can convergence while the agents do not know anything about the MA function. The agents only observe the assigned reward by MA and use it to optimize their own policy. Once all agents execute their actions and receive the reward, the MA receives the feedback and updates the weight $w$. Training the MA with a gradient based algorithm is quite expensive, so in the numerical experiments, a Bayesian optimization with an expected improvement acquisition function is used. To train the agents, an actor-critic algorithm is used with a two-layer neural network as the value network. The value network shares its parameters with the actor network, and an A2C algorithm [a2c] is used to train the actor. It is proved that under a given condition, a reward modifier function exists such that it maximizes the expectation of the reward modifier function. In other words, a Markov-Nash equilibrium (M-NE) exists in which each agent follows a policy that provides the highest possible value for each agent. Then, convergence to the optimal solution is proved under certain conditions. To demonstrate the performance of their algorithm, a problem with 2000 agents is considered in which the desire location of agents changes through time.

## 5 Consensus

The idea of the centralized critic, which is discussed in Section 3, works well when there exists a small number of agents in the communication network. However, with increasing the number of agents, the volume of the information might overwhelm the capacity of a single unit. Moreover, in the sensor-network applications, in which the information is observed across a large number of scattered centers, collecting all this local information to a centralized unit, under some limitations such as energy limitation, privacy constraints, geographical limitations, and hardware failures, is often a formidable task. One idea to deal with this problem is to remove the central unit and allow the agents to communicate through a sparse network, and share the information with only a subset of agents, in order to reach a consensus over a variable with these agents (called neighbors). Besides the numerous real-world applications of this setting, this is quite a fair assumption in the applications of MARL [zhang2018fully, jiang2018graph]. By limiting the number of neighbors to communicate, the amount of communication remains linear in the order of the number of neighbors. In this way, each agent uses only its local observations, though uses some shared information from the neighbors to stay tuned with the network. Further, applying the consensus idea, there exist several works which prove the convergence of the proposed algorithms when the linear approximators are utilized. In the following, we review some of the leading and most recent papers in this area.

[varshavskaya2009efficient] studies a problem in which each agent has a local observation, executes its own local policy and receives a local reward. A tabular policy optimization agreement algorithm is proposed which uses Boltzmann’s law (similar to soft-max function) to solve this problem. The agreement (consensus) algorithm assumes that an agent can send its local reward, a counter on the observation, and the taken action per observation to its neighbors. The goal of the algorithm is maximizing weighted average of the local rewards. In this way, they guarantee that each agent learns as much as a central learner could, and therefore converges to a local optimum.

In [kar2013distributed, kar2013cal] the authors proposed a decentralized multi-agent version of the tabular Q-learning algorithm called $\mathcal{Q}\mathcal{D}$-learning. In this paper, the global reward is expressed as the sum of the local rewards, though each agent is only aware of its own local reward. In the problem setup, the authors assume that the agents communicate through a time-invariant un-directed weakly connected network, to share their observations with their direct neighbors. All agents observe the global state and the global action, and the goal is to optimize the network-averaged infinite horizon discounted reward. The $\mathcal{Q}\mathcal{D}$-learning works as follows. Assume that we have $N$ agents in the network and agent $i$ can communicate with its neighbors ${\mathcal{N}}_{i}$. This agent stores ${Q}_{i}(s,a)$ values for all possible state-action pairs. Each update for the agent $i$ at time $t$ includes the regular Q-learning plus the deviation of the Q-value from its neighbor as below:

${Q}_{i}^{t+1}(s,a)={Q}_{i}^{t}(s,a)$ | $+{\alpha}_{s,a}^{t}(r{}_{i}({s}_{t},{a}_{t})+\gamma \underset{{a}^{\prime}\in \mathcal{A}}{\mathrm{min}}Q{}_{i}{}^{t}({s}_{t+1},{a}^{\prime})-Q{}_{i}{}^{t}(s,a))$ | |||

$-{\beta}_{s,a}^{t}{\displaystyle \sum _{j\in {\mathcal{N}}_{i}^{t}}}\left({Q}_{i}^{t}(s,a)-{Q}_{j}^{t}(s,a)\right),$ | (19) |

where $\mathcal{A}$ is the set of all possible actions, and ${\alpha}_{s,a}^{t}$ and ${\beta}_{s,a}^{t}$ are the stepsizes of the $\mathcal{Q}\mathcal{D}$-learning algorithm. It is proved that this method converges to the optimal Q-values in an asymptotic behavior under some specific conditions on the step-sizes. In other words, under the given condition, they prove that their algorithm obtains the result that the agent could achieve if the problem was solved centrally.

In [DAC_Pennesi_2010] a distributed Actor-Critic (D-AC) algorithm is proposed under the assumption that the states, actions, and rewards are local for each agent; however, each agent’s action does not change the other agents’ transition models. The critic step is performed locally, meaning that each agent evaluates its own policy using the local reward receives from the environment. In particular, the state-action value function is parameterized using a linear function and the parameters are updated in each agent locally using Temporal Difference algorithm together with the eligibility traces. The Actor step on the other hand is conducted using information exchange among the agents. First, the gradient of the average reward is calculated. Then a gradient step is performed to improve the local copy of the policy parameter along with a consensus step. A convergence analysis is provided, under diminishing step-sizes, showing that the gradient of the average reward function tends to zero for every agent as the number of iterations goes to infinity. In this paper, a sensor network problem with multiple mobile nodes has been considered for testing the proposed algorithm. In particular, there are $M$ target points and $N$ mobile sensor nodes. Whenever one node visits a target point a reward will be collected. The ultimate goal is to train the moving nodes in the grid such that the long-term reward becomes maximized. They have been considered a $20\times 20$ grid with three target points and sixteen agents. The numerical results prove that the reward improves over time while the policy parameters reach consensus.

[macua2017diff] proposes a new algorithm, called Diffusion-based Distributed Actor-Critic (Diff-DAC) for single and multi-task multi-agent RL. In the setting, there are $N$ agents in the network such that there is one path between any two agents, each is assigned either the same task or a different task than the others, and the goal is to maximize the weighted sum of the value functions overall tasks. Each agent runs its own instance of the environment with a specific task, without intervening the other agents. For example, each agent runs a given cart-pole problem where the pole length and its mass are different for different agents. Basically, one agent does not need any information, like state, action, or reward of the other agents. Agent $i$ learns the policy with parameters ${\theta}_{i}$, while it tries to reach consensus with its neighbors using diffusion strategy. In particular, the Diff-DAC trains multiple agents in parallel with different and/or similar tasks to reach a single policy that performs well in average for all tasks, meaning that the single policy might obtain a high reward for some tasks but performs poorly for the others. The problem formulation is based on the average reward, average value-function, and average probability transition. Based on that, they provide a linear programming formulation of the tabular problem and provide its Lagrangian relaxation and the duality condition to have a saddle point. A dual-ascent approach is used to find a saddle point, in which (i) a primal solution is found for a given dual variable by solving an LP problem, and then (ii) a gradient ascend is performed in the direction of the dual variables. These steps are performed iteratively to obtain the optimal solution. Next, the authors propose a practical algorithm utilizing a DNN function approximator. During the training of this algorithm, agent $i$ first performs an update over the weights of the critic as well as actor networks using local information. Then, a weighted average is taken over the weights of both networks which assures these networks reach consensus. The algorithm is compared with a centralized training for the Cart-Pole game.

In [zhang2018fully] a MARL problem is considered with the following setup. There exists a common environment for all agents and the global state $s$ is available to all of them, each agent takes its own local action ${a}_{i}$, and the global action $a=[{a}_{1},{a}_{2},\mathrm{\cdots},{a}_{N}]$ is available to all $N$ agents, each agent receives ${r}_{i}$ reward after taking an action and this reward is visible only in agent $i$. In this setup, the agents can do time-varying randomly communication with their direct neighbors ($\mathcal{N}$) to share some information. Two AC-based algorithms are proposed to solve this problem. In the first algorithm, each agent has its own local approximation of the Q-function with weights ${w}_{i}$, though a fair approximation needs global reward ${r}_{t}$ (not local rewards ${r}_{t}^{i},\forall i=1,\mathrm{\dots},N$). To address this issue, it is assumed that each agent shares parameter ${w}_{i}$ with its neighbors and in this way a consensual estimate of ${Q}_{w}$ can be achieved. To update the critic, the temporal difference is estimated by

$${\delta}_{i}^{t}={r}_{i}^{t+1}-{\mu}_{i}^{t}+Q({s}_{t+1},{a}_{t+1},{w}_{i}^{t})-Q({s}_{t},{a}_{t},,{w}_{i}^{t}),$$ |

in which

$${\mu}_{i}^{t}=(1-{\beta}_{wt}^{t}){\mu}_{i}^{t}+{\beta}_{wt}{r}_{i}^{t+1},$$ | (20) |

i.e. the moving average of the agent $i$ rewards with parameter ${\beta}_{it}$, and the new weights ${w}_{i}$ are achieved locally. To achieve the consensus a weighted sum (with weights coming from the consensus matrix) of the parameters of the neighbors’ critics are calculated as below:

$${w}_{i}^{t+1}=\sum _{j\in {\mathcal{N}}_{i}}{c}_{ij}{\stackrel{~}{w}}_{j}^{t}.$$ | (21) |

This weighted sum provides the new weights of the critic $i$ for the next time step. To update the actor, each agent observes the global state and the local action to update its policy; though during the training, the advantage function requires actions of all agents, as mentioned earlier. In the critic update, the agents do not share any rewards info and neither actor policy; so, in some sense, the agents keep the privacy of their data and policy. However, they share the actions with all agents, so that the setting of the problem is pretty similar to MADDPG algorithm; although, MADDPG assumes the local observation in the actor. In the second algorithm, besides sharing the critic weights, the critic observes the moving average estimate for rewards of the neighbor agents and uses that to obtain a consensual estimate of the reward. Therefore, this algorithm performs the following update instead of (20):

$${\stackrel{~}{\mu}}_{i}^{t}=(1-{\beta}_{wt}^{t}){\mu}_{i}^{t}+{\beta}_{wt}{r}_{i}^{t+1},$$ |

in which ${\stackrel{~}{\mu}}_{i}^{t}={\sum}_{j\in {\mathcal{N}}_{i}}{c}_{ij}{\stackrel{~}{\mu}}_{j}^{t}.$ Note that in the second algorithm agents share more information among their neighbors. From the theoretical perspective, The authors provide a global convergence proof for both algorithms in the case of linear function approximation. In the numerical results, they provide the results on two examples: (i) a problem with 20 agents and |S|=20, (ii) a completely modified version of cooperative navigation [lowe2017multi] with 10 agents and $|S|=40$, such that each agent observes the full state and they added a given target landmark to cover for each agent; so agents try to get closer to the certain landmark. They compare the results of two algorithms with the case that there is a single actor-critic model, observing the rewards of all agents and the centralized controller is updated there. In the first problem, their algorithms converged to the same ’return value’ that the centralized algorithms achieve. In the second problem, it used a neural network and with that non-linear approximation and their algorithms got a small gap compared to the solutions of the centralized version.

[zhang2018networked] considers the MARL problem with continuous state and action space. The rest of the setting is similar to the [zhang2018fully] (i.e., global state, global action, and local reward). Again, an AC-based algorithm is proposed for the considered problem. In general, for the continuous spaces, stochastic policies lead to a high variance in gradient estimation. Therefore, to deal with this issue deterministic policy gradient (DPG) algorithm is proposed in [silver2014deterministic] which requires off-policy exploration. However, in the setting of [zhang2018networked] the off-policy information of each agent is not known to other agents, so the approach used in DPG [silver2014deterministic, lillicrap2015continuous] cannot be applied here. Instead, a gradient update based on the expected policy-gradient (EPG) [ciosek2018expected] is proposed, which uses a global estimate of Q-value, approximated by the consensus update. Thus, each agent shares parameters ${w}_{i}$ of each Q-value estimator with its neighbors. Given the mentioned assumptions, the convergence guarantees with a linear approximator are provided and the performance is compared with a centrally trained algorithm for the same problem.

Following similar setting as [zhang2018fully], [suttle2019multi] proposes a new distributed off-policy actor-critic algorithm, such that there exists a global state visible to all agents, each agent takes an action which is visible to the whole network, and receives a local reward which is available only locally. The main difference between this work and [zhang2018fully] comes from the fact that the critic step is conducted in an off-policy setting using emphatic temporal differences $ETD(\lambda )$ policy evaluation method [sutton2016emphatic]. In particular, $ETD(\lambda )$ uses state-dependent discount factor ($\gamma $) and state-dependent bootstrapping parameter ($\lambda $). Besides, in this method there exists an interest function $f:\mathcal{S}\to {\mathbb{R}}^{+}$ that takes into account the user’s interest in specific states. The algorithm steps are as the following: First, each agent performs a consensus step over the critic parameter. Since the behavior policy is different than the target policy for each agent, they apply importance sampling [kroese2012monte] to re-weight the samples from the behavior policy in order to correspond them to the target policy. Then, an inner loop starts to perform another consensus step over the importance sampling ratio. In the next step, a critic update using $ETD(\lambda )$ algorithm is performed locally and the updated weights are broadcast over the network. Finally, each agent performs the actor update using local gradient information for the actor parameter. Following the analysis provided for $ETD(\lambda )$ in [yu2015convergence], the authors proved the convergence of the proposed method for the distributed actor-critic method when linear function approximation is utilized.

[zhang2017datadriven] proposes a consensus RL algorithm, in which each agent uses its local observations as well as its neighbors within a given directed graph. The multi-agent problem is modeled as a control problem, and a consensus error is introduced. The control policy is supposed to minimize the consensus error while stabilizes the system and gets the finite local cost. A theoretical bound for the consensus error is provided, and the theoretical solution for having the optimal policy is discussed, which indeed needs environment dynamics. A practical actor-critic algorithm is proposed to implement the proposed algorithm. The practical version involves neural network approximator via a linear activation function. The critic measures the local cost of each agent, and the actor network approximates the control policy. The results of their algorithm on leader-tracking communication problem are presented and compared with the known optimal solution.

In [distributed_macua_2015], an off-policy distributed policy evaluation algorithm is proposed. In this paper, a linear function has been used to approximate the long-term reward of a given policy (target policy), which is assumed to be the same for all agents, while different agents follow different policies along the way. In particular, a distributed variant of the Gradient Temporal Difference (GTD) algorithm ^{1}^{1}
1
GTD algorithm is proposed to stabilize the TD algorithm with linear function approximation in an off-policy setting. [GTD2_Sutton_2009] is developed utilizing a primal-dual optimization scenario. In order to deal with the off-policy setting, they have applied the importance sampling technique. The state space, action space, and transition probabilities are the same for every node, but their actions do not influence each other. This assumption makes the problem stationary. Therefore, the agents do not need to know the state and the action of the other agents. Regarding the reward, it is assumed that there exists only one global reward in the problem. First, they showed that GTD algorithm is a stochastic Arrow-Hurwicz ^{2}^{2}
2
Arrow-Hurwicz is a primal-dual optimization algorithm which performs the gradient step on the Lagrangian over the primal and dual variables iteratively [arrow_hurwicz] algorithm applied to the dual problem of the original optimization problem. Then, inspiring from [chen2012diffusion], they proposed a diffusion-based distributed GTD algorithm. Under sufficiently small but constant step-sizes, they provide a mean-square-error performance analysis which proves that the proposed algorithms converge to a unique solution. In order to evaluate the performance of the proposed method, a 2-D grid world problem with 15 agents is considered. Two different policies are evaluated using distributed GTD algorithm. It is shown that the diffusion strategy helps the agents to benefit from the other agents’ experiences.

Considering similar setup as [distributed_macua_2015], in [multiagent_stankovic_2016] two multi-agent policy evaluation algorithms were proposed over a time-varying communication network. A given policy is evaluated using the samples derived from different policies in different agents (i.e. off-policy). Same as [distributed_macua_2015], it is assumed that the actions of the agents do not interfere with each other. Weak convergence is provided for both algorithms.

Another variant of the distributed GTD algorithm was proposed in [PrimalDual_Lee_2018]. Each agent in the network is following a local policy ${\pi}_{i}$ and the goal is to evaluate the global long term reward which is the sum of the local rewards. In this work, it is assumed that each agent can observe the global joint state. A linear function which combines the features of the states is used as the estimator for the value function. The problem is modeled as a constrained optimization problem (consensus constraint), and then following the same process as [distributed_macua_2015], a primal-dual algorithm was proposed to solve it. A rigorous analysis based on the ordinary differential equation (ODE) [borkar_ode_2000] is provided for the proposed algorithm. Particularly, they showed that the set of KKT solutions of the Lagrangian dual problem can be compactly written by an ODE in the form of $\dot{x}=-Ax-b$. In order to keep the stability of the algorithm, they add some box constraints over the variables. Finally, under diminishing step-size, they prove that the distributed GTD (DGTD) converges with probability one. One of the numerical examples is considered a stock market problem where $N=5$ different agents have different policies for trading the stocks. DGTD is utilized to estimate the average of long term discounted profit of all agents. The results are compared with a single GTD algorithm in the case the sum of the reward is available. The comparison results show that each agent can successfully approximate the global value function.

[cassano2018multi] considers two different scenarios for the policy evaluation task: (i) each agent is following a policy (behavior policy) different than others, and the goal is to evaluate a target policy (i.e. off-policy). In this case, each agent has only knowledge about its own state and reward, which is independent of the other agents’ state and reward. (ii) The state is global and visible to all agents, the reward is local for each agent, and the goal is to evaluate the target team policy. They propose a Fast Diffusion for Policy Evaluation (FDPE) for the case with a finite data set, which combines off-policy learning, eligibility traces, and linear function approximation. This algorithm can be applied to both scenarios mentioned earlier. The main idea here is to apply variance-reduced algorithm called AVRG [ying2018convergence] over a finite data set to get a linear convergence rate. Further, they modified the cost function to control the bias term. In particular, they use $h$-stage Bellman Equation to derive $H$-truncated $\lambda $-weighted Mean Square Projected Bellman Error ($H\lambda $- MSPBE) compare to the usual cases (e.g. [distributed_macua_2015]) where they use Mean Square Projected Bellman Error (MSPBE). It is shown that the bias term can be controlled through $(H,\lambda )$. Also, they add a regularization term to the cost function, which can be useful in some cases.

A distributed off-policy actor-critic algorithm is proposed in [zhang_distributed_2019]. In contrast to the [zhang2018fully], where the actor step is performed locally and the consensus update is proposed for the critic, in [zhang_distributed_2019] the critic is performed locally, and the agents asymptotically achieve consensus on the actor parameter. The state space and action space are continuous and each agent has the local state and action; however, the global state and the global action are visible to all agents. Both policy function and value function are linearly parameterized. A convergence analysis is provided for the proposed algorithm under diminishing step-sizes for both actor and critic steps. The effectiveness of the proposed method was studied on the distributed resource allocation problem.

## 6 Learn to Communicate

As mentioned earlier in Section 5, some environments allow the communication of agents. The consensus algorithms use the communication bandwidth to pass raw observation, policy weight/gradients, critic weight/gradients, or some combination of them. A different approach to use the communication bandwidth is to learn a communication-action (like a message) to allow agents to be able to send information that they want. In this way, the agent can learn the time for sending a message, the type of message, and the destination agent. Usually, the communication-actions do not interfere with the environment, i.e., the messages do not affect the next state or reward. [kasai2008learning] proposed one of first learning to communicate algorithms, in which tabular Q-learning agents learn messages to communicate with other agents in the predator-prey environment. The same approach with a tabular RL is followed in [varshavskaya2009efficient]. Besides these early works, there are several recent papers in this area which utilize the function approximator. In this section, we discuss some of the more relevant papers in this research area.

In one of the most recent works, [foerster2016learning] consider a problem to learn how to communicate in a fully cooperative (recall that in a fully cooperative environment, agents share a global reward) multi-agent setting in which each agent accesses a local observation and has a limited bandwidth to communicate to other agents. Suppose that $\mathcal{M}$ and $\mathcal{U}$ denote message space and action space respectively. In each time-step, each agent takes action $u\in \mathcal{U}$ which affects the environment, and decides for action $m\in \mathcal{M}$ which does not affect the environment and only other agents observe it.
The proposed algorithm follows the centralized learning decentralized execution paradigm under which it is assumed in the training time agents do not have any restriction on the communication bandwidth.
They propose two main approaches to solve this problem. Both approaches use DRQN [hausknecht2015deep] to address partial observability, and disabled experience replay to deal with the non-stationarity.
The input of Q-network for agent $i$ at time $t$ includes ${o}_{i}^{t}$, ${h}_{i}^{t}$ (the hidden state of RNN), ${\{{u}_{j}^{t-1}\}}_{j}$, and ${\{{m}_{j}^{t-1}\}}_{j}$ for all $j\in \{1,\mathrm{\dots},N\}$. When parameter sharing is used, $i$ is also added in the input which helps learn specialized networks for agent $i$ within parameter sharing. All of the input values are converted into a vector of the same size either by a look-up table or an embedding (separate embedding of each input element), and the sum of these *same-size vectors* is the final input to the network. The network returns $|\mathcal{M}|+|\mathcal{U}|$ outputs for selecting actions $u$ and $m$. The network includes two layers of GRU, followed by two MLP layers, and the final layer with $|U|+|M|$ representing $|U|$ Q-values. $|M|$ is different on two algorithms and is explained in the following. First they propose *reinforced inter-agent learning* (RIAL) algorithm. To select the communication action, the network includes additional $|M|$ Q-values to select discrete action ${m}_{t}^{i}$. They also proposed a practical version of RIAL in which the agents share the policy parameters so that RIAL only needs to learn one network.
The second algorithm is *differentiable inter-agent learning* (DIAL), in which the message is continuous and the message receiver provides some feedback—in form of gradient—to the message sender, to minimize the DQN loss. In other words, the receiver obtains the gradient of its Q-value w.r.t the received message and sends it back to the sender so that the sender knows how to change the message to optimize the Q-value of the receiver.
Intuitively, agents are rewarded for the communication actions, if the receiver agent correctly interprets the message and acts upon that.
The network also creates a continues vector for the communication action so that there is no action selector for the communication action ${m}_{i}^{t}$, and instead, a regularizer unit discretize it, if necessary.
They provide numerical results on switch riddle prisoner game and three communication game with mnist dataset. The results are compared with no-communication and parameter sharing version of RIAL and DIAL methods.

[jorge2016learning] extends DIAL in three directions: (i) allow communication of arbitrary size, (ii) gradually increase noise on the communication channels to make sure that the agents learn a symbolic language, (iii) agents do not share parameters. They provide the results of their algorithm on a version of "Guess Who?" game, in which two agents, namely "asking" and "answering", participate. The game is around guessing the true image that the answering agent knows, while the asking agent has $n$ images and by asking $n/2$ questions should guess the correct image. The answering agent returns only "yes/no" answers, and after $n/2$ questions the asking agent guesses the target image. The results of their algorithm with different parameters is presented. Following a similar line, [lazaridou2016multi] considers a problem with two agents and one round of communication to learn an interpretable language among the sender and receiver agents. The sender receives two images, while it knows the target image, and sends a message to the receiver along with the images. If the receiver guesses the correct image, both win a reward. Thus, they need to learn to communicate through the message. Each individual agent converts the image to a vector using VGG ConvNet [simonyan2014very]. The sender builds a neural network on top of the input vector to select one of the available symbols (values $10$ and $100$ are used in the experiments) as the message. The receiver embeds the message into the same size as of the images’ vector, and then through a neural network combines them together to obtain the guess. Both agents use REINFORCE algorithm [williams1992simple] to train their model, and do not share their policies with each other. There is not any pre-designed meaning associated with the utilized symbols. Their results demonstrate high success rate and show that the learned communications are interpretable. In another work in this direction, in [das2017learning] a fully cooperative two-agent game is considered for the task of image guessing. In particular, two bots, namely questioner bot (Q-BOT) and an answerer bot (ABOT) communicate in natural language and the task for Q-BOT is to guess an unseen image from a set of images. At every round of the game, Q-BOT asks a question, A-BOT provides an answer. Then the Q-BOT updates its information and makes a prediction about the image. The action space is common among both agents consisting of all possible output sequences under a token vocabulary $V$, though the state is local for each agent. For A-BOT the state includes the sequence of questions and answers, the caption provided for the Q-BOT besides the image itself; while the state for Q-BOT does not include the image information. There exists a single reward for both agents in this game. Similar to [simonyan2014very] the REINFORCE algorithm [williams1992simple] is used to train both agents. Note that [jorge2016learning] allows "yes/no" actions within multiple rounds of communication, [lazaridou2016multi] consists of one single round with continues messages, and [das2017learning] combines them such that multiple rounds of continues communication are allowed.

Similarly, [mordatch2018emergence] studies a joint reward problem, in which each agent observes locations and communication messages of all agents. Each agent has a given goal vector $g$ accessed only privately (like moving to or gazing at a given location), and the goal may involve interacting with other agents. Each agent chooses one physical action (e.g., moving or gazing to a new location) and chooses one of the $K$ symbols from a given vocabulary list. The symbols are treated as abstract categorical variables without any predefined meaning and agents learn to use each symbol for a given purpose. All agents have the same action space and share their policies. Unlike [lazaridou2016multi] there is an arbitrary number of agents and they do not have any predefined rules, like speaker and listener, and the goals are not specifically defined such as the correct utterance. The goal of the model is to maximize the reward while creating an interpretable and understandable language for humans. To this end, a soft penalty also is added to encourage small vocabulary sizes, which results in having multiple words to create a meaning. The proposed model uses the state variable of all agents and uses a fully connected model to obtain the embedding ${\mathrm{\Phi}}_{s}$. Similarly, ${\mathrm{\Phi}}_{c}$ is obtained as an embedding of all messages. Then, it combines the goal of the agent $i$, ${\mathrm{\Phi}}_{s}$, and ${\mathrm{\Phi}}_{c}$ through a fully connected model to obtain ${\psi}_{u}$ and ${\psi}_{c}$. Then, the physical action $u$ is equal to ${\psi}_{u}+\u03f5$ and the communication message is $c\sim G({\psi}_{c})$, in which $\u03f5\sim N(0,1)$ and $G(c)=-\mathrm{log}\left(-\mathrm{log}(c)\right)$ is a Gumble-softmax estimator [jang2016categorical]. The results of the algorithm are compared with a no-communication approach in the mentioned game.

[sukhbaatar2016learning] considers a fully cooperative MARL task in which each agent observes a local state and is able to send a continues communication message to the other agents. They propose a model, called CommNet, in which a central controller takes the state observations and the communication messages of all agents, and runs multi-step communications to provide actions of all agents in the output. CommNet assumes that each agent receives the messages of all agents. In the first round, the state observations ${s}_{i}$ of agent $i$ are encoded to ${h}_{i}^{0}$, and the communication messages ${c}_{i}^{0}$ are zero. Then in each round $$, the controller concatenates all ${h}_{i}^{t-1}$ and ${c}_{i}^{t-1}$, passes them into function $f(.)$, which is a linear layer followed by a non-linear function and obtains ${h}_{i}^{t}$ and ${c}_{i}^{t}$ for all agents. To obtain the actions, ${h}_{i}^{K}$ is decoded to provide a distribution over the action space. Furthermore, they provide a version of the algorithm that assumes each agent only observes the messages of its neighbor. Note that, compared to [foerster2016learning], commNet allows multiple rounds of communication between agents, and the number of agents can be different in different episodes. The performance of CommNet is compared with independent learners, the fully connected, and discrete communication over a traffic junction and Combat game from [sukhbaatar2015mazebase].

To extend CommNet, [hoshen2017vain] proposes Vertex Attention Interaction Network (VAIN), which adds an attention vector to learn the importance weight of each message. Then, instead of concatenating the messages together, the weighted sum of them is obtained and used to take the action. VAIN works well when there are sparse agents who interact with each agent. They compare their solution with CommNet over several environments.

In [peng2017multiagent] the authors introduce a bi-directional communication network (BiCNet) using recurrent neural network such that heterogeneous agents can communicate with different sets of parameters. Then a multi-agent vectorized version of AC algorithm is proposed for combat game. In particular, there exists two vectorized networks, namely actor and critic network which are shared among all agents, and each component of the vector represents an agent. The policy network takes the shared observation together with the local information and returns the actions for all agents in the network. The Bi-directional recurrent network is designed in a way to be served as a local memory too. Therefore, each individual agent is capable of maintaining its own internal states besides sharing the information with its neighbors. In each iteration of the algorithm, the gradient of both networks are calculated and the weights of the networks get updated accordingly using Adam algorithm. In order to reduce the variance, they applied the deterministic off-policy AC algorithm [silver2014deterministic]. The proposed algorithm applied to the multi-agent StarCraft combat game [samvelyan19smac]. It is shown that BiCNet is able to discover several effective ways to collaborate during the game.

[singh2018learning] considers the MARL problem in which each agent has a local reward and observation. An algorithm called Individualized Controlled Continuous Communication Model (IC3Net) is proposed to learn to *what and when to communicate*, which can be applied to cooperative, competitive, and semi-cooperative environments ^{3}^{3}
3
Semi-cooperative environments are those that each agent looks for its own goal while all agents also want to maximize a common goal..
IC3Net allows multiple continues communication cycles and in each round uses a gating mechanism to decide to communicate or not. The local observation ${o}_{i}^{t}$ are encoded and passed to an LSTM model, which its weights are shared among the agents. Then, the final hidden state ${h}_{i}^{t}$ of the LSTM for agent $i$ in time step $t$ is used to get the final policy.
A Softmax function $f(.)$ over ${h}_{i}^{t}$ returns a binary action to decide whether to communicate or not. Considering message ${c}_{i}^{t}$ of agent $i$ at time $t$, the action ${a}_{i}^{t}$ and ${c}_{i}^{t+1}$ are:

${g}_{i}^{t+1}=$ | $f({h}_{i}^{t}),$ | (22) | ||

${h}_{i}^{t+1},{l}_{i}^{t+1}=$ | $\text{LSTM}(e({o}_{i}^{t})+{c}_{i}^{t},{h}_{i}^{t},{l}_{i}^{t}),$ | (23) | ||

${c}_{i}^{t+1}=$ | $\frac{1}{N-1}}C{\displaystyle \sum _{i\ne j}}{h}_{j}^{t+1}{g}_{j}^{t+1},$ | (24) | ||

${a}_{i}^{t}=$ | $\pi ({h}_{i}^{t}),$ | (25) |

in which ${l}_{i}^{t}$ is the cell state of the LSTM cell, $C$ is a linear transformator, and $e(.)$ is an embedding function. Policy $\pi $ and gating function $f$ are trained using REINFORCE algorithm [williams1992simple]. In order to analyze the performance of IC3Net, predator-pray, traffic junction [sukhbaatar2016learning], and StarCraft with explore and combat tasks [samvelyan19smac] are considered. The results are compared with CommNet [sukhbaatar2016learning], no-communication model, and no-communication model with only global reward.

In [jaques2018intrinsic], the authors aim to avoid centralized learning in multi-agent RL problems, when each agent observes local state ${o}_{i}^{t}$, takes a local action, and observes local reward ${z}_{i}^{t}$ from the environment. The key idea is to define a reward called intrinsic reward for influencing the other agents’ actions. In particular, each agent simulates the potential actions that it can take and measures the effect on the behavior of other agents by having the selected action. Then, the actions which have a higher effect on the action of other agents will be rewarded more. Following this idea, the reward function ${r}_{i}^{t}=\alpha {z}_{i}^{t}+\beta {c}_{i}^{t}$ is used, where ${c}_{i}^{t}$ is the casual influence reward on the other agents, $\alpha $, and $\beta $ are some trade-off weights. ${c}_{i}^{t}$ is computed by measuring the KL difference in the policy of agent $j$ when ${a}_{i}$ is known or is unknown as below:

$${c}_{i}^{t}=\sum _{j\ne i}\left[{D}_{KL}[p({a}_{j}^{t}|{a}_{i}^{t},{o}_{i}^{t})||p({a}_{j}^{t}|{o}_{i}^{t})]\right]$$ | (26) |

In order to measure the influence reward, two different scenarios are considered: (i) a centralized training, so each agent observes the probability of another agent’s action for a given counterfactual, (ii) modeling the other agents’ behavior. The first case can be easily handled by the equation (26). In the second case, each agent is learning $p({a}_{j}^{t}|{a}_{i}^{t},{o}_{i}^{t})$ through a separate neural network. In order to train these neural networks, the agents use the history of observed actions and the cross-entropy loss functions. The proposed algorithm is analyzed on harvest and clean-up environments and is compared with an A3C baseline and baseline which allows agents to communicate with each other. This work is partly relevant to *Theory of Mind* which tries to explain the effect of agents on each other in multi-agent settings. To see more details see [rabinowitz2018machine].

[das2018tarmac] proposes an algorithm, called TarMAC, to learn to communicate in a multi-agent setting, where the agents *learn to what to sent* and also learn to *communicate to which agent*.
They show that the learned policy is interpretable, and can be extended to competitive and mixed environments. To make sure that the message gets enough attention by the intended agents, each agent also encodes some information in the continuous message to define the type of the agent that the message is intended for. This way, the receiving agent can measure the relevance of the message to itself.
The proposed algorithm follows a centralized training with decentralized execution paradigm. Each agent accesses local observation, observes the messages of all agents, and the goal is maximizing the team reward $R$, while the discrete actions are executed jointly in the environment.
Each agent sends a message including two parts, the signature $({k}_{i}^{t}\in {R}^{{d}_{k}})$ and the value $({v}_{i}^{t}\in {R}^{{d}_{v}})$. The signature part provides the information of the intended agent to receive the message, and the value is the message. Each recipient $j$ receives all messages and learns variable ${q}_{j}^{t}\in {R}^{{d}_{k}}$ to receive the messages. Multiplying ${q}_{j}^{t}$ and ${k}_{i}^{t}$ for all $i\in \{1,\mathrm{\dots},N\}$ results in the attention weights of ${\alpha}_{ij}$ for all messages from agents $i\in \{1,\mathrm{\dots},N\}$. Finally, the aggregated message ${c}_{i}^{t}$ is the weighted sum of the message values, which the weights are the obtained attention values. This aggregated message and the local observation ${o}_{i}^{t}$ are the input of the local actor. Then, a regular actor-critic model is trained which uses a centralized critic. The actor is a single layer GRU layer, and critic uses the joint actions $\{{a}_{1},\mathrm{\dots},{a}_{N}\}$ and the hidden state $\{{h}_{1},\mathrm{\dots},{h}_{N}\}$ to obtain the Q-value. Also, the actors share the policy parameters to speed-up the training, and multi-round communication is used to increase efficiency. The proposed method (with no attention, no communication version of the algorithm) is compared with SHAPES [andreas2016neural], traffic junction in which they control the cars, and House3D [wu2018building] as well as the CommNet [sukhbaatar2016learning] when it was possible.

All the papers discussed so far in this section assumed the existence of a communication message and basically, they allow each agent to learn what to send. In a different approach, [jiang2018learning] fixes the message type and only allows each agent to decide to start a communication with the agents in its receptive field. They consider the problem that each agent observes the local observation, takes a local action, and receives a local reward. The key idea here is that when there is a large number of agents, sharing the information of all agents and communication might not be helpful since it is hard for the agent to differentiate the valuable information from the shared information. In this case, the communication might impair the learning. To address this issue, an algorithm, called ATOC is proposed in which an attention unit learns when to integrate the shared information from the other agents. In ATOC, each agent encodes the local observation, i.e. ${h}_{i}^{t}={\mu}_{I}({o}_{i}^{t};{\theta}_{mu})$ in which ${\theta}_{mu}$ is the weights of a MLP. Every $T$ time step, agent $i$ runs an attention unit with input ${h}_{i}^{t}$ to determine whether to communicate with the agents in its receptive field or not. If it decides to communicate, a communication group with at most $m$ collaborator is created and does not change for $T$ time-steps. Each agent in this group sends the encoded information ${h}_{i}^{t}$ to the communication channel, in which they are combined and ${\stackrel{~}{h}}_{i}^{t}$ is returned for each agent $i\in {\mathcal{M}}_{\U0001d4be}$, where ${\mathcal{M}}_{\U0001d4be}$ is the list of the selected agents for the communication channel. Then, agent $i$ merges ${\stackrel{~}{h}}_{i}^{t}$ with ${h}_{i}^{t}$, passes it to the MLP and obtains ${a}_{i}^{t}={\mu}_{II}({h}_{i}^{t},{\stackrel{~}{h}}_{i}^{t};{\theta}_{\mu})$. Note that one agent can be added in two communication channel, and as a result, the information can be transferred among a larger number of agents. The actor and critic models are trained following the same way as the DDPG model, and the gradients of the actor (${\mu}_{II}$) are also passed in the communication channel, if relevant. Also, the difference of the Q-value with and without communication is obtained and is used to train the attention unit. Some numerical experiments on the particle environment are done and ATOC is compared with CommNet, BiCNet, and DDPG (ATOC without any communication). Their experiments involve at least 50 agents, and MADDPG was not a benchmark.

## 7 Emerging Areas

In this section, we discuss a few recent papers which either combine the approaches in sections 3-6 or they propose a model that does not quite fit in either of the previous sections.

[foerster2018multi] considers a problem in which each agent observes a local observation, selects an action which is not known to other agents, and receives a joint reward, known to all agents. Further, it is assumed that all agents access a common knowledge, and all know that any agent knows this information, and each agent knows that all agents know that all agents know it and so on. Also, there might be subgroups of the agents who share more common knowledge, and the agents inside each group use a centralized policy to take action and each agent plays its own action in a decentralized model. Typically, subgroups of the agents have more common knowledge and the selected action would result in higher performance than the actions selected by larger groups. So, having groups of smaller size would be interesting. However, there is a computational trade-off between the selecting smaller or larger subgroups since there is numerous possible combination of agents to form smaller groups. This paper proposes an algorithm to address this challenge, i.e., divide the agents to a new subgroup or take actions via a larger joint policy. The proposed algorithm, called MACKRL, provides a hierarchical RL that in each level of hierarchy decides either to choose a joint action for the subgroup or propose a partition of the agents into smaller subgroups. This algorithm is very expensive to run since the number of possible jointed-agents increases exponentially and the algorithm becomes intractable. To address this issue, a pairwise version of the algorithm is proposed in which there are three levels of hierarchies, the first for grouping agents, the second one for either action-selection or sub-grouping, and the last one for action selection. Also, a Central-V algorithm is presented for training actor and critic networks.

In [shu2018m] a different setting of the MARL is considered. In this problem, a manager along with a set of self-interested agents (workers) with different skills and preferences work on a set of tasks. In this setting, the agents like to work on their preferred tasks (which may not be profitable for the entire project) unless they offered the right bonus for doing a different task. Furthermore, the manager does not know the skills and preferences (or any distribution of them) of each individual agent in advance. The problem goal is to train the manager to control the workers by inferring their minds and assigning incentives to them upon the completion of particular goals. The approach includes three main modules. (i) Identification, which uses workers’ performance history to recognize the identity of agents. In particular, the performance history of agent $i$ is denoted by ${\mathbb{P}}_{i}=\{{P}_{i}^{t}=({\rho}_{igb}^{t}):t=1,2,\mathrm{\cdots},T\}$, where ${\rho}_{igb}^{t}$ is the probability of worker $i$ finishes the goal $g$ in $t$ steps with $b$ bonuses. In this module, these matrices are flattened into a vector and encoded to history representation denoted by ${h}_{i}$. (ii) Modeling the behavior of agents. A worker’s mind is modeled by its performance, intentions, and skills. In mind tracker module, the manager encodes both current and past information to updates its beliefs about the workers. Formally, let’s ${\mathrm{\Gamma}}_{i}^{t}=({s}_{i}^{\tau},{a}_{i}^{\tau},{g}_{i}^{\tau},{b}_{i}^{\tau}):\tau =1,2,\mathrm{\cdots},t$ denotes the trajectory of worker $i$. Then mind tracker module $M$ receives ${\mathrm{\Gamma}}_{i}^{t}$ as well as history representation ${h}_{i}$ from the first module and outputs ${m}_{i}$ as the mind for agent $i$. (iii) Training the manager, which includes assigning the goals and bonuses to the workers. To this end, the manager needs to have all workers as a context defined as ${c}^{t+1}=C(\{({s}_{i}^{t+1},{m}_{i}^{t},{h}_{i}):i=1,2,\mathrm{\cdots},N\})$, where $C$ pools all workers information. Then utilizing both individual information and the context, the manager module provides the goal policy ${\pi}^{g}$ and bonus policy ${\pi}^{b}$ for all workers. All three modules are trained using the A2C algorithm. The proposed algorithm is evaluated in two environments: Resource Collection and Crafting in 2D Minecraft. The results demonstrate that the manager can estimate the workers’ mind through monitoring their behavior and motivate them to accomplish the tasks they do not prefer.

Next, we discuss MARL in hierarchical setting. To do so, let us briefly introduce the hierarchical RL. In this setting, the problem is decomposed into a hierarchy of tasks such that easy-to-learn tasks are at the lower level of the hierarchy and a strategy to select those tasks can be learned at a higher level of the hierarchy. Thus, in the hierarchical setting, the decisions at the high level are made less frequently than those at the lower level, which usually happens at every step. The high level policy is mainly focused on long-run planning, which involves several one-step tasks in the low-level of the hierarchy. Following this approach, in single agent hierarchical RL (e.g. [kulkarni2016hierarchical, vezhnevets2017feudal]), a meta-controller at the high-level learns a policy to select the sequence of tasks and a separate policy is trained to perform each task at the low-level.

For the hierarchical MARL, two possible scenarios are synchronous and asynchronous. In the synchronous hierarchical MARL, all high-level agents take actions at the same time. In other words, if one agent takes its low-level actions earlier than other agents, it has to wait until all agents finish their low-level actions. This could be a restricted assumption if the number of agents is quite large. On the other hand, there is no restriction on asynchronous hierarchical MARL. Nonetheless, obtaining high-level cooperation in asynchronous case is challenging. In the following we study some recent papers in hierarchical MARL.

In [tang2018hierarchical] a cooperative problem with sparse and delayed rewards is considered, in which each agent accesses a local observation, takes a local action, and submit the joint action into the environment to get the local rewards. Each agent has some low-level and high-level actions to take such that the problem of the task selection for each agent can be modeled as a hierarchical RL problem. To solve this problem, three algorithms are proposed: Independent hDQN (Ind-hDQN), hierarchical Communication networks (hCom), and hierarchical hQmix. Ind-hDQN is based on the hierarchical DQN (hDQN) [kulkarni2016hierarchical] and decomposes the cooperative problem into independent goals and then learns them in a hierarchical manner. In order to analyze Ind-hDQN, first, we describe hDQN—for the single-agent—and then explain Ind-hDQN for multi-agent setting. In hDQN, the meta-controller is modeled as a semi-MDP (SMDP) and the aim is to maximize

$${\stackrel{~}{r}}_{t}=R({s}_{t+\tau}|{s}_{t},{g}_{t})={r}_{t}+\mathrm{\cdots}+{r}_{t+\tau},$$ |

where, ${g}_{t}$ is the selected goal by the meta-controller and $\tau $ is the stochastic number of periods to achieve the goal. Via ${\stackrel{~}{r}}_{t}$, a DQN algorithm learns the meta-controller policy. This policy decides which low-level task should be taken at each time step. Then, the low-level policy learns to maximize the goal-dependent reward ${\widehat{r}}_{t}$. In Ind-hDQN it is assumed that agent $i$ knows local observation ${o}_{i}^{t}$, its meta-controller learns policy ${\pi}_{i}({g}_{i}^{t}|{o}_{i}^{t})$, and in the low-level it learns policy ${\widehat{\pi}}_{i}({a}_{i}^{t}|{g}_{i}^{t})$ to interact with the environment. The low-level policy is trained by the environment’s reward signals ${r}_{i}^{t}$ and the meta-controller’s policy is trained by the intrinsic reward ${\widehat{r}}_{i}^{t}$. Since Ind-hDQN trains independent agents, it can be applied to both synchronous and asynchronous settings.

In the second algorithm, named hCom, the idea of CommNet [sukhbaatar2016learning] is combined with Ind-hDQN. In this way, Ind-hDQN’s neural network is modified to include the average of the ${h}^{\text{th}}$ hidden layers of other agents, i.e., it is added as the ${(h+1)}^{\text{th}}$ layer of each agent. Similar to Ind-hDQN, hCom works for both synchronous and asynchronous settings. The third algorithm, hQmix, is based on Qmix [rashid2018qmix] to handle the case that all agents share a joint reward ${r}_{t}$. To this end, the Qmix architecture is added to the meta-controller and as a result, the Qmix allows training separated Q-values for each agent. This is possible by learning ${Q}_{tot}$ as is directed in the Qmix. hQmix only is applicable for synchronous settings, since ${Q}_{tot}$ is estimated over joint-action of all agents. In each of the proposed algorithms, the neural network’s weights of the policy are shared among the tasks that have the same input and output dimensions. Moreover, the weights of the neural network are shared among the agents for the low-level policies. Thus, only one low-level network is trained; although, it can be used for different tasks and by all agents. In addition, a new replay buffer, Augmented Concurrent Experience Replay (ACER), is proposed. ACER saves transition tuple $({o}_{i}^{t},{g}_{i}^{t},{\stackrel{~}{r}}_{i}^{t},\tau ,{o}_{i}^{t+\tau})$ for meta-controller and saves $A{E}_{i}(t,\tau )={\{({o}_{i}^{t+k},{g}_{i}^{t},{\stackrel{~}{r}}_{i}^{t+k},\tau -k,{o}_{i}^{t+\tau})\}}_{k=0}^{\tau -1}$ to train the low-level policy. ACER also uses the idea of Concurrent Experience Replay Trajectories (CERTs) [omidshafiei2017deep] such that experiences are stored in the rows of episodes and columns of time steps to ensure availability of concurrent mini-batches. They have analyzed their algorithm in Multiagent Trash Collection tasks (an extension of environment [makar2001hierarchical]) and Fever Basketball Defense. In the experiments, the low-level learning is done for several homogeneous agents so that it can be considered as a single-agnet learning problem. The results are compared with Ind-DQN and Ind-DDQN with prioritized experience replay [schaul2015prioritized].

In the literature of MARL, there are several papers who consider the games with Nash equilibrium. In a MARL setting Nash equilibrium is achieved, if all agents get their own highest possible value-function and not willing to change it. Here we only discuss a few recent papers since [lanctot2017unified] provides a detailed review of the older papers.

[zhang2018scc] discusses the coordination problem in the multi-agent cooperative domains with a fully observable state and continues actions to find the Pareto-optimal Nash-equilibrium. The paper proposes an algorithm named Sample Continuous Coordination with recursive Frequency Maximum Q-Value (SCC-rFMQ) which includes two main parts: (i) given state $s$, a set of discrete actions from the continues set ${A}_{i}(s)$ are selected for agent $i$, (ii) the action evaluation and training the policy are performed. The first phase involves selecting a set of good actions which are seen before while performs exploration. In this way, it follows a Coordination Re-sample (CR) strategy that preserves $n/3$ best previous actions for each agent. The rest of the actions are selected according to the variable probability distribution, i.e. get actions randomly via $N({a}_{max}(s),\sigma )$, in which ${a}_{max}(s)$ is the action that gives maximum Q-value for state $s$. Let use ${a}^{\ast}(s)$ to denote the action of the best seen Q-value. If ${a}_{max}(s)\ne {a}^{\ast}(s)$, exploration rate ${\sigma}_{i}(s)$ is reset to the initial value of 1/3; otherwise if $$, CR shrinks ${\sigma}_{i}(s)$ with a given rate; and expands it otherwise. Using the new ${\sigma}_{i}(s)$, new actions are selected with $N({a}_{max},\sigma )$ and resets ${A}_{i}(s)$ and $Q(s,a)$.

Beside CR, [zhang2018scc] also utilizes rFMQ [matignon2012independent], which extends Q-Learning with the frequency value $F(s,a)$. In this way, in addition to $Q(s,a)$, ${Q}_{max}(s,a)$ is also obtained and updated through the learning procedure. To select actions, an evaluation-value function $E(s,a)$ is obtained to run greedily such that $E(s,a)=(1-F(s,a))Q(s,a)+F(s,a){Q}_{max}(s,a)$ and $F(s,a)$ estimates the percentage of time that action $a$ results in observing the maximum reward, for state $s$. The estimation of the frequency $F(s,a)$ is recursively updated via a separate learning rate. In order to show the effectiveness of SCC-rFMQ, the results of a climbing game, a two-player game with Nash equilibrium is presented, as well as the boat navigation problem with two actions. SCC-rFMQ is compared with MADDPG, Sequential Monte Carlo Methods (SMC) [lazaric2008reinforcement], and rFMQ [matignon2012independent]. SMC [lazaric2008reinforcement] itself is an actor-critic algorithm with continues action space. In this algorithm, the actor takes action randomly such that the probability of extraction of each action is equal to the importance weight of the action. Also, the critic approximates an action-value function based on the observed reward of the playing action. Then, based on the function, actor updates the policy distribution. In this way, for each state, the actor provides a probability distribution over the continues action space and based on the importance of sampling, the action is selected.

## 8 Applications

The MARL problem has numerous application in the real world. In this section, we review some of the application-oriented papers, in which the problem is modeled as a MARL problem. The main focus will be on describing the problem and what kind of approach is used to solve the problem so that there are not much of technical details about the problem and algorithm.

A multi-agent Q-learning algorithm has been proposed in [wang2016multi] for dynamic web service composition problem. In this problem, there exists a sequence of tasks that need to be done in order to accomplish the web service composition problem. The key idea in this work is to decompose the main task into independent sub-tasks. Then a tabular-based Q-learning algorithm is applied to find the optimal strategy for the sub-task. In addition, to improve the efficiency of the proposed method they proposed the experience sharing strategy. In this case, there exists a supervisor agent, who is responsible to communicate with all other agents to spread the existing knowledge in a particular agent among all other ones.

[prabuchandran2014multi] proposes a tabular Q-learning algorithm to control multiple traffic signals on neighbor junctions to maximize the traffic flow. Each agent (the traffic light) accesses the local observations, including the number of lanes and the queue length at each lane, decides about the green light duration, and shares the queue length with its neighbors. The cost of each agent is the average of the queue length at its neighbors so that it tries to minimize the queue length of its own queue as well as all its neighbors. The algorithm is compared with two classical approaches in a simulated environment of two areas with 9 and 12 junctions in India. Similar problem with the RL approach is studied in [abdoos2011traffic]. Also, recently [zhang2019cityflow] provided CityFlow, a new traffic-signal environment to be used for MARL researches. In a related area, [lin2018efficient] considers rid-sharing management problem and propose a customized Q-learning and an actor-critic algorithm for this problem. The algorithms are evaluated on a built simulation with Didi Chuxinghe, which is largest rid-sharing company in China. In a similar direction, [brittain2019autonomous] studies the air-traffic control problem as a MARL problem to ensure safe separation between aircraft. Each airplane, as learning agent, shares the state information with $N$ closest agents. Each state includes distance to the goal, speed, acceleration, distance to the intersection, and the distance to the $N$ closest airplanes. Each agent decides about the change of speed from three possible choices and receives a penalty if it is in a close distance of another airplane. The goal of the model is to identify and address the conflicts between air-crafts in high-density intersections. The A2C algorithm is used, while a centralized training decentralized execution approach is followed. The actor and critic network share the weights. BlueSky air traffic control simulator is used as the environment and the results of a case with 30 airplanes are provided.

In [xiao2018distributed] a distributed tabular Q-learning algorithm was proposed for bike rebalancing problem. Using distributed RL, they improve the current solutions in terms of frequency of trips and the number of bikes are being moved on each trip. In the problem setup, the action is the number of bikes to move. The agents receive a positive reward if the bike stock is within a particular rang, and negative reward otherwise. Also, there exists a negative reward for every bike moves in each hour. Each agent acts independently; however, there exists a controller called a knowledge repository (KR) that shares the learning information among the agents. More specifically, the KR is designed to facilitate the transfer learning [lazaric2012transfer] among the agents. Using only distributed RL the success ratio ( the number of successfully rebalanced stations over the total stations) is improved about $10\%$ to $35\%$. Furthermore, combining with the transfer learning, the algorithm rebalances the network $62.4\%$ better than the one without transfer learning.

[zhang2009multi] considers an online distributed resource allocation problem, such that each agent is a server and observes the information of its neighbors. In each time step, each agent receives a task and has to decide to allocate it locally, or has to pick a neighbor and send it to that neighbor. Once the task is finished, the agent is rewarded by some utility of the task. Due to the communication bandwidth constraints, the number of tasks that each agent can send to its neighbors is limited and the goal is to cooperatively maximize the utility of the whole cluster. An algorithm based on tabular Q-learning is proposed and its results are compared with a centralized controller’s result. [wu2011novel] also considers a similar problem and propose another value-based algorithm. Moreover, in [wu2018decentralised] a similar problem is studied in which there are $n$ schedulers who dispatches jobs of $k$ customers in $m$ machines. They also propose a value-based algorithm and use a gossip mechanism to transfer utilities among the neighbor agents. In a similar domain, [ye2015multi] consider a packet routing problem in the wireless sensor networks and model it as a MARL problem. In this problem, the sensor data should be sent to a base station to analyze them, though usually each sensory unit has a limited capacity to store data, and there are strict communication bandwidth limits. This problem has several applications in surveillance, video recording, processing and communicating. When a packet is sent from a given sensory unit, the distance to the destination and the size of the packet determines the required energy to send the packet, and one of the goals of the system is to minimize the sum of consumed energy. They propose an algorithm based on the Q-learning in which each agent selects some cooperating neighbors in a given radius and then can communicate with them. The results are compared with several classical algorithms. Within the similar domain, [dandanov2017dynamic] propose a RL algorithm to deal with the antenna tilt angle in mobile antenna, to get a compromise between the mobile coverage and the capacity of the network. The reward matrix of the problem is built and transition probability among the states are known so that optimal value for each state-action is achieved.

In [mousavi2019multiagent] the authors show that a decentralized multi-agent reinforcement learning can be used for image classification. In their framework, multiple agents receive partial observations of the environment, communicate with the neighbors on a communication graph, and relocate to update their locally available information. Using an extension of the policy gradient algorithm, an algorithm is proposed to update the prediction and motion planning modules in an end-to-end manner. The results on MNIST data-set is provided in which each agent only observes a few pixels of the image.

[bao2019multi] consider the liquidation of a large amount of stock in a limited time $T$. The liquidation process usually is done with cooperation several traders/brokers and massively impacts the market. The problem becomes more complex when other entities want to liquidate the same stock in the market. This problem can be modeled as a MARL with (i) competitive objectives: when each agent want to sell its own stock with the highest price, (ii) cooperative objective: when several agents want to cooperatively sell the stock of one customer at the highest price. The problem is modeled as MARL in which each agent has to select to sell $a\in [0,1]$ percent of stocks in each time step. If the agent selects to sell nothing, it takes the risk of dropped prices and at the end of $T$ time periods, the trader has to sell all remaining stocks even with zero price. An adapted version of the DDPG algorithm is proposed to solve this problem.

## 9 Environments

Environments are core elements in any MDP problem. Basically, the environment provides the scope of the problem and different problems/environments have been the motivation for developing the new RL algorithms. Interacting with the real world is usually expensive, time-consuming, or sometimes impossible. Thus, using the simulator of environments is the common practice. Using simulators helps to compare different algorithms and indeed provides a framework to compare different algorithms. With these motivations, several single agent environments are developed. Among them, Arcade which provides Atari-2600 [bellemare2013arcade], MoJoCo (simulates the detailed physics movements of human and some animals body) [todorov2012MuJoCoAP], OpenAI Gym which gathers these together [brockman2016openai], PyGame Learning Environment (similar to Arcade) [tasfi2016PLE], OpenSim (builds musculoskeletal structures of human) [seth2011opensim], DeepMind Lab (3D navigation and puzzle-solving) [beattie2016deepmind], ViZDoom (3D Shooting and Navigation Doom using only the visual information) [Kempka2016ViZDoom], Malmo (based on Minecraft) [johnson2016malmo], MINOS (Home indoor 3D Navigation) [savva2017minos], House3D (3D Navigation in indoor area) [wu2018building], and MazeLab [mazelab] just are few to mention. Either of these environments at least include standard step and reset functions, such that env.reset() returns a random initial state and env.step(${\mathrm{a}}_{\mathrm{t}}$) returns ${s}_{t+1},{r}_{t},{d}_{t},$ dict in which dict is some additional information. This is a general structure which makes it possible to reproduce a given trajectories with a given policy.

In the multi-agent domain, there is a smaller number of available environments. In addition, there is a broad range of possible settings for sharing information among different agents. For example, some environments involve communication actions, some share the joint reward, some share the global state and each of these cases need special algorithms and not all of the algorithms in Sections 3-7 can be applied to each of these problems. Considering these limitations, there is a smaller number of environments in each setting. Among those, StarCraft II [samvelyan19smac] has achieved a lot of attention in the recent years [foerster2017stabilising, usunier2016episodic, singh2018learning, foerster2018counterfactual, rashid2018qmix, peng2017multiagent]. In this game, each agent only observes its own local information and receives a global reward, though different versions of the game with the globally observable state are also available. Finding dynamic goals also has been a common benchmark for both discrete and continues action spaces. Multi-agent particle environment [mordatch2017emergence] gathers a list of navigation tasks, e.g., the predator-prey for both discrete and continues actions. Some of the games allow choosing a communication action too. Harvest-gathering [jaques2018intrinsic] is a similar game with communication-action. Neural MMO [suarez2019neural] provides MMORPGs (Massively Multiplayer Online Role-Playing Games) environment like Pokemon in which agents learn combat and navigation while there is a large population of the same agents with the same goal. In the area of traffic management, [zhang2019cityflow] provided CityFlow, a new traffic-signal environment to be used for MARL researches. In addition, [wu2017flow] introduce a framework to control cars within a mixed system of human-like driver and AI agents.

In addition to the mentioned environments which are proposed by different papers, few projects gathered some of the environments together to provide a similar framework like OpenAI Gym for the multi-agent system. [jiang2019MarlEnv] provides 12 environments navigation/maze-like games. Unity [juliani2018unity] is a platform to develop single/multi-agent games which can be simple grid-world or quite complex strategic games with multiple agents. The resulted game can be used as an environment for training machine learning agents. The framework supports cooperative and competitive multi-agent environments. Unity gives the ability to create any kind of multi-agent environment that is intended; although it is not specifically designed for multi-agent systems. Arena [song2019arena] extends the Unity engine and provides an specific platform to define and build new multi-agent games and scenarios based on the available games. The framework includes 38 multi-agent games from which 27 are new games. New scenarios or games can be built on top of these available games. Designing the games involve a GUI-based configurable social tree, and reward function can be selected from five reward scheme that are proposed to cover most of the possible cases in competitive/cooperative games. Similarly, Ray platform [moritz2018ray] also recently started to support multi-agent environments and some of the known algorithms are also added in rllib repository [liang2017ray, liang2017rllib]. Ray supports entering/leaving the agents from the problem which is a common case in traffic control suites.

## 10 Conclusion and Future Research Directions

In this review, we categorize MARL problems into five groups, namely independent-learners, fully observable critic, value function decomposition, consensus, and learn to communicate. Then we provide an overview of the most recent papers in these classes. For each paper, first, we have highlighted the problem setup, such as the availability of global state, global action, reward, and the communication pattern among the agents. Then, we presented the key idea and the main steps of the proposed algorithm. Finally, we listed the environments which have been used for evaluating the performance of the algorithm. In addition, among the broad range of applications of MARL for real-world problems, we picked a few representative ones and showed how MARL can be utilized for solving such complicated problems.

In Table 1 a summary the of most influential papers in each category is presented. In this summary we have gathered the the general settings of the considered problem and the proposed algorithm to show the gaps and the possible research directions. For example, the table shows that with value decomposition (VD), there is not any research that considers the local states, the local actions, and the local policies. In this table, the third column, com, shows the the communication status, i.e., $0$ means there is no communication among the agents, and $1$ otherwise. In forth column, AC means the proposed algorithm is actor-critic based, and $Q$ is used when the proposed algorithm is value-based. In fifth column, conv stands for the convergence. Here, $1$ is for the case that the convergence analysis is provided and otherwise it is $0$. In the last three columns tuple (Trn, Exe) stands for (Training and Execution) and G and L are for the global and local availability respectively, and determine whether state, action, and policy of each agent is known to other agents.

State | Action | Reward | |||||

Reference | Com | AC/Q | Conv | (Trn, Exe) | (Trn, Exe) | (Trn, Exe) | |

IQL |
tan1993multi, tan1993multi [tan1993multi] | 0 | Q | 0 | (L,L) | (L,L) | (G,G) |

lauer2000algorithm, lauer2000algorithm [lauer2000algorithm] | 0 | Q | 0 | (G,G) | (L,L) | (G,G) | |

matignon2007hysteretic, matignon2007hysteretic [matignon2007hysteretic] | 0 | Q | 0 | (G,G) | (G,L) | (G,G) | |

tampuu2017multiagent, tampuu2017multiagent [tampuu2017multiagent] | 0 | Q | 0 | (G,G) | (L,L) | (G,G) | |

omidshafiei2017deep, omidshafiei2017deep [omidshafiei2017deep] | 0 | Q | 0 | (G,G) | (G,L) | (G,G) | |

fuji2018deep, fuji2018deep [fuji2018deep] | 0 | Q | 0 | (G,G) | (G,L) | (G,G) | |

Fully Observable Critic |
wang2019r, wang2019r [wang2019r] | 1 | AC | 0 | (G,L) | (G,L) | (L,L) |

foerster2018counterfactual, foerster2018counterfactual [foerster2018counterfactual] | 0 | AC | 0 | (G,L) | (G,L) | (G,G) | |

ryu2018multi, ryu2018multi [ryu2018multi] | 0 | AC | 0 | (G,L) | (G,L) | (L/G,L/G) | |

sartoretti2019distributed, sartoretti2019distributed [sartoretti2019distributed] | 0 | AC | 0 | (G,G) | (L,L) | (L,L) | |

chu2017parameter, chu2017parameter [chu2017parameter] | 0 | AC | 0 | (L,L) | (L,L) | (L/G,L/G) | |

yang2018cm3, yang2018cm3 [yang2018cm3] | 0 | AC | 0 | (L,L) | (L,L) | (L,L) | |

jiang2018graph, jiang2018graph [jiang2018graph] | 1 | Q | 0 | (L,L) | (L,L) | (L,L) | |

iqbal2018actor, iqbal2018actor [iqbal2018actor] | 1 | AC | 0 | (G,L) | (G,L) | (G,L) | |

yang2018mean, yang2018mean [yang2018mean] | 1 | AC/Q | 1 | (G,G) | (G,L) | (G,L) | |

kim2019learning, kim2019learning [kim2019learning] | 1 | AC | 0 | (G,L) | (G,L) | (G,G) | |

lowe2017multi, lowe2017multi [lowe2017multi] | 0 | AC | 0 | (G,L) | (G,L) | (L,L) | |

jiang2018graph, jiang2018graph [jiang2018graph] | 0 | Q | 0 | (G,L) | (G,L) | (L,L) | |

VD |
rashid2018qmix, rashid2018qmix [rashid2018qmix] | 0 | Q | 0 | (G,L) | (G,L) | (G,G) |

sunehag2018value, sunehag2018value [sunehag2018value] | 0 | Q | 0 | (L,L) | (L,L) | (G,G) | |

mguni2018inducing, mguni2018inducing [mguni2018inducing] | 0 | Q | 1 | (L,L) | (L,L) | (G,G) | |

son2019qtran, son2019qtran [son2019qtran] | 0 | Q | 1 | (L,L) | (L,L) | (G,G) | |

Consensus |
zhang2018fully, zhang2018fully [zhang2018fully] | 1 | AC | 1 | (G,G) | (G,L) | (L,L) |

kar2013cal, kar2013cal [kar2013cal] | 1 | Q | 1 | (G,G) | (L,L) | (L,L) | |

PrimalDual_Lee_2018, PrimalDual_Lee_2018 [PrimalDual_Lee_2018] | 1 | Q | 1 | (G,G) | (L,L) | (L,L) | |

distributed_macua_2015, distributed_macua_2015 [distributed_macua_2015] | 1 | Q | 0 | (L,L) | (L,L) | (G,G) | |

macua2017diff, macua2017diff [macua2017diff] | 1 | AC | 0 | (L,L) | (L,L) | (L,L) | |

cassano2018multi, cassano2018multi [cassano2018multi] | 1 | Q | 1 | (L/G,L/G) | (L,L) | (L/G,L/G) | |

zhang2018networked, zhang2018networked [zhang2018networked] | 1 | AC | 1 | (G,G) | (G,L) | (L,L) | |

zhang_distributed_2019, zhang_distributed_2019 [zhang_distributed_2019] | 1 | AC | 1 | (G,G) | (G,G) | (L,L) | |

Learn to comm |
varshavskaya2009efficient, varshavskaya2009efficient [varshavskaya2009efficient] | 1 | AC | 1 | (L,L) | (L,L) | (L,L) |

peng2017multiagent, peng2017multiagent [peng2017multiagent] | 1 | AC | 0 | (G,G) | (G,G) | (L,L) | |

foerster2016learning, foerster2016learning [foerster2016learning] | 1 | Q | 0 | (L,L) | (L,L) | (G,G) | |

sukhbaatar2016learning, sukhbaatar2016learning [sukhbaatar2016learning] | 1 | AC | 0 | (L,L) | (L,L) | (G,G) | |

singh2018learning, singh2018learning [singh2018learning] | 1 | AC | 0 | (L,L) | (L,L) | (L,L) | |

lazaridou2016multi, lazaridou2016multi [lazaridou2016multi] | 1 | AC | 0 | (G,G) | (L,L) | (G,G) | |

das2017learning, das2017learning [das2017learning] | 1 | AC | 0 | (L,L) | (G,G) | (L,L) |

Beside this table, in what follows, we list a few potential research directions and open problems in the realm of MARL.

Off-policy MARL: In RL, off-policy refers to learning about a policy (called target policy), while the learning agent is behaving under a different policy (called behavior policy ). This method is of great interest because it can learn about an optimal policy while it explores and also learns about multiple policies only by following a single policy. While the former advantage is helpful in both single and multi-agent setting, the latter one seems to be more fit to the multi-agent setting. In MARL literature, there exist a few algorithms utilizing the off-policy [zhang2018networked, suttle2019multi, distributed_macua_2015]; however, there is still room for research on off-policy in MARL in algorithm design, theory, and applications.

Safe MARL: Safe RL is defined as the training of agent to learn a policy which maximizes the long-term reward while ensures reasonable performance in order to respect safety constraints and avoid catastrophic situations during the training as well as execution of the policy. The main approaches in safe RL are based on introducing the risk concept in optimality conditions and regulating the exploration process to avoid undesirable actions. There is numerous research in safe RL for single-agent RL. See a comprehensive review in [garcia2015comprehensive]. Nonetheless, the research on safe MARL is very scarce. For example, in [shalev2016safe], a safe RL algorithm is proposed for Multi-Agent Autonomous Driving. Therefore, another straightforward research direction for MARL would be Safe MARL in order to provide more applicable policies in this setup.

Heterogeneous MARL: Most of the works we studied above are homogeneous MARL, meaning that all the agents in the network are identical in terms of the ability and skill. However, in real-world applications, it is likely that we face a MARL problem where agents have different skills and abilities. Therefore, an additional problem here would be how different agents should utilize the other agents’ abilities to learn a more efficient policy. As a special case, consider the human-machine interaction. Particularly, humans are able to solve some RL problems very quickly using their experience and cognitive abilities. For example, in a 2D small space, they can find a very good approximation of the shortest path very quickly no matter how complex is the search space. On the other hand, machines have the ability to solve more complex problems in high-dimensional spaces. However, optimality comes at the cost of computational complexity so that oftentimes only a feasible solution is possible. The question that needs to be answered in this problem is the following: Is it possible to develop MARL algorithms that combine heterogeneous agents’ abilities toward maximizing the long-term gain? Moreover, can this be done in a principled way that comes with performance guarantees?

Optimization in MARL: Without doubt optimization is an indispensable part of the RL problems. Any progress in optimization methods may lead to more efficient RL and in turn MARL algorithms. In recent years, there has been a flurry of research on designing optimization algorithms for solving complex problems including nonconvex and nonsmooth optimization problems for multi-agent and distributed systems [bianchi2012convergence, di2016next, hong2017prox]. However, the literature of MARL still lacks those algorithms. Future research directions on MARL from the optimization perspective can be divided into two main branches: First, applying the existing optimization algorithms (or adapt them when necessary) on MARL problems. For instance, TRPO [schulman2015trust], which has been shown to be very efficient in single-agent RL problems, might be helpful for MARL problems as well. Second, focusing on the theory part of the algorithms. Despite the decent performance of the numerical methods, which utilize the neural networks in MARL, there exists a huge gap between such numerical performance and some kind of convergence analysis. Therefore, this might be the time to think out of the box and focus on the theory part of the neural networks too.