### Abstract

In multi-agent reinforcement learning, discovering successful collectivebehaviors is challenging as it requires exploring a joint action space thatgrows exponentially with the number of agents. While the tractability ofindependent agent-wise exploration is appealing, this approach fails on tasksthat require elaborate group strategies. We argue that coordinating the agents'policies can guide their exploration and we investigate techniques to promotesuch an inductive bias. We propose two policy regularization methods: TeamReg,which is based on inter-agent action predictability and CoachReg that relies onsynchronized behavior selection. We evaluate each approach on four challengingcontinuous control tasks with sparse rewards that require varying levels ofcoordination. Our methodology allocates the same hyper-parameter search budgetacross our algorithms and baselines and we find that our approaches are morerobust to hyper-parameter variations. Our experiments show that our methodssignificantly improve performance on cooperative multi-agent problems and scalewell when the number of agents is increased. Finally, we quantitatively analyzethe effects of our proposed methods on the policies that our agents learn andwe show that our methods successfully enforce the qualities that we propose asproxies for coordinated behaviors.

### Quick Read (beta)

# Promoting Coordination through Policy Regularization

in Multi-Agent Deep Reinforcement Learning

###### Abstract

In multi-agent reinforcement learning, discovering successful collective behaviors is challenging as it requires exploring a joint action space that grows exponentially with the number of agents. While the tractability of independent agent-wise exploration is appealing, this approach fails on tasks that require elaborate group strategies. We argue that coordinating the agents’ policies can guide their exploration and we investigate techniques to promote such an inductive bias. We propose two policy regularization methods: TeamReg, which is based on inter-agent action predictability and CoachReg that relies on synchronized behavior selection. We evaluate each approach on four challenging continuous control tasks with sparse rewards that require varying levels of coordination. Our methodology allocates the same hyper-parameter search budget across our algorithms and baselines and we find that our approaches are more robust to hyper-parameter variations. Our experiments show that our methods significantly improve performance on cooperative multi-agent problems and scale well when the number of agents is increased. Finally, we quantitatively analyze the effects of our proposed methods on the policies that our agents learn and we show that our methods successfully enforce the qualities that we propose as proxies for coordinated behaviors.

## 1 Introduction

Multi-Agent Reinforcement Learning (MARL) refers to the task of training an agent to maximize its expected return by interacting with an environment that contains other learning agents. It represents a challenging branch of Reinforcement Learning (RL) with interesting developments in recent years (hernandez2018multiagent). A popular framework for MARL is the use of a Centralized Training and a Decentralized Execution (CTDE) procedure (lowe2017multi; foerster2018counterfactual; iqbal2018actor; foerster2018bayesian; rashid2018qmix). Typically, one leverages centralized critics to approximate the value function of the aggregated observations-actions pairs and train actors restricted to the observation of a single agent. Such critics, if exposed to coordinated joint actions leading to high returns, can steer the agents’ policies toward these highly rewarding behaviors. However, these approaches depend on the agents luckily stumbling on these collective actions in order to grasp their benefit. Thus, it might fail in scenarios where such behaviors are unlikely to occur by chance. We hypothesize that in such scenarios, coordination-promoting inductive biases on the policy search could help discover successful behaviors more efficiently and supersede task-specific reward shaping and curriculum learning. We support this proposition by presenting a simple and scalable environment on which agents that learn in hand-crafted coordinated policy spaces succeed remarkably faster notwithstanding the simplicity of the task.

For more realistic tasks in which coordinated strategies cannot be easily engineered and must be learned, we propose to transpose this insight by relying on two coordination proxies to bias the policy search. The first avenue, TeamReg, assumes that an agent must be able to predict the behavior of its teammates in order to coordinate with them. The second, CoachReg, supposes that coordinated agents collectively recognize different situations and synchronously switch to different sub-policies to react to them. In the following sections, we show how to derive practical policy regularizing terms from these premises and meticulously evaluate them^{1}^{1}
1
Source code for the algorithms and environments will be made public upon publication of this work..

Our contributions are threefold. Firstly, we show that coordination can crucially accelerate multi-agent learning for cooperative tasks. Secondly, we propose two novel approaches that aim at promoting such coordination through inductive biases in the policy search. Our methods augment CTDE MARL algorithms through additional multi-agent objectives that act as policy regularizers and are optimized jointly with the main return-maximization objective. Thirdly, we design two new sparse-reward cooperative tasks in the multi-agent particle environment (mordatch2018emergence). We use them along with two standard multi-agent tasks to present a detailed evaluation of our approaches against three different baselines. We validate our methods’ key components by performing an ablation study and a detailed analysis of their effect on agents’ behaviors. Our experiments suggest that our TeamReg objective provides a dense learning signal that helps to guide the policy towards coordination in the absence of external reward, eventually leading it to the discovery of high performing team strategies in a number of cooperative tasks. Similarly, by enforcing synchronous sub-policy selections, CoachReg enables cooperating agents to concurrently fine-tune sub-behaviors for each collectively recognized situation yielding significant improvements on the overall performance.

## 2 Background

### 2.1 Markov Games

We consider the framework of Markov Games (littman1994markov), a multi-agent extension of Markov Decision Processes (MDPs). A Markov Game of $N$ independent agents is defined by the tuple $\u27e8\mathcal{S},\mathcal{T},\mathcal{P},{\{{\mathcal{O}}^{i},{\mathcal{A}}^{i},{\mathcal{R}}^{i}\}}_{i=1}^{N}\u27e9$ where $\mathcal{S}$, $\mathcal{T}$, and $\mathcal{P}$ are respectively the set of all possible states, the transition function and the initial state distribution. While these are global properties of the environment, ${\mathcal{O}}^{i}$, ${\mathcal{A}}^{i}$ and ${\mathcal{R}}^{i}$ are individually defined for each agent $i$. They are respectively the observation functions, the sets of all possible actions and the reward functions. At each time-step $t$, the global state of the environment is given by ${s}_{t}\in \mathcal{S}$ and every agent’s individual action vector is denoted by ${a}_{t}^{i}\in {\mathcal{A}}^{i}$. To select their action, each agent $i$ only has access to its own observation vector ${o}_{t}^{i}$ which is extracted by the observation function ${\mathcal{O}}^{i}$ from the global state ${s}_{t}$. The initial state ${s}_{0}$ is sampled from the initial state distribution $\mathcal{P}:\mathcal{S}\to [0,1]$ and the next state ${s}_{t+1}$ is sampled from the probability distribution over the possible next states given by the transition function $\mathcal{T}:\mathcal{S}\times \mathcal{S}\times {\mathcal{A}}^{1}\times \mathrm{\dots}\times {\mathcal{A}}^{N}\to [0,1]$. Finally, at each time-step, each agent receives an individual scalar reward ${r}_{t}^{i}$ from its reward function ${\mathcal{R}}^{i}:\mathcal{S}\times \mathcal{S}\times {\mathcal{A}}^{1}\times \mathrm{\dots}\times {\mathcal{A}}^{N}\to \mathbb{R}$. Agents aim at maximizing their expected discounted return $\mathbb{E}\left[{\sum}_{t=0}^{T}{\gamma}^{t}{r}_{t}^{i}\right]$ over the time horizon $T$, where $\gamma \in [0,1]$ is a discount factor.

### 2.2 Multi-Agent Deep Deterministic Policy Gradient

MADDPG (lowe2017multi) is an adaptation of the Deep Deterministic Policy Gradient algorithm (lillicrap2015continuous) to the multi-agent setting. It allows the training of cooperating and competing decentralized policies through the use of a centralized training procedure. In this framework, each agent $i$ possesses its own deterministic policy ${\mu}^{i}$ for action selection and critic ${Q}^{i}$ for state-action value estimation, which are respectively parametrized by ${\theta}^{i}$ and ${\varphi}^{i}$. All parametric models are trained off-policy from previous transitions ${\zeta}_{t}\u2254({\mathbf{o}}_{t},{\mathbf{a}}_{t},{\mathbf{r}}_{t},{\mathbf{o}}_{t+1})$ uniformly sampled from a replay buffer $\mathcal{D}$. Note that ${\mathbf{o}}_{t}\u2254[{o}_{t}^{1},\mathrm{\dots},{o}_{t}^{N}]$ is the joint observation vector and ${\mathbf{a}}_{t}\u2254[{a}_{t}^{1},\mathrm{\dots},{a}_{t}^{N}]$ is the joint action vector, obtained by concatenating the individual observation vectors ${o}_{t}^{i}$ and action vectors ${a}_{t}^{i}$ of all $N$ agents. Each centralized critic is trained to estimate the expected return for a particular agent $i$ from the Q-learning loss (watkins1992q):

$$\begin{array}{cc}& {\mathcal{L}}^{i}({\varphi}^{i})={\mathbb{E}}_{{\zeta}_{t}\sim \mathcal{D}}\left[\frac{1}{2}{\left({Q}^{i}({\mathbf{o}}_{t},{\mathbf{a}}_{t};{\varphi}^{i})-{y}_{t}^{i}\right)}^{2}\right]\hfill \\ & {y}_{t}^{i}={r}_{t}^{i}+\gamma {Q}^{i}({\mathbf{o}}_{t+1},{\mathbf{a}}_{t+1};{\overline{\varphi}}^{i})|{}_{{a}_{t+1}^{j}={\mu}^{j}({o}_{t+1}^{j};{\overline{\theta}}^{j})\forall j}\hfill \end{array}$$ | (1) |

For a given set of weights $w$, we define its target counterpart $\overline{w}$, updated from $\overline{w}\leftarrow \tau w+(1-\tau )\overline{w}$ where $\tau $ is a hyper-parameter. Each policy is updated to maximize the expected discounted return of the corresponding agent $i$ :