### Abstract

A central challenge in multi-agent reinforcement learning is the induction ofcoordination between agents of a team. In this work, we investigate how topromote inter-agent coordination and discuss two possible avenues basedrespectively on inter-agent modelling and guided synchronized sub-policies. Wetest each approach in four challenging continuous control tasks with sparserewards and compare them against three variants of MADDPG, a state-of-the-artmulti-agent reinforcement learning algorithm. To ensure a fair comparison, werely on a thorough hyper-parameter selection and training methodology thatallows a fixed hyper-parameter search budget for each algorithm andenvironment. We consequently assess both the hyper-parameter sensitivity,sample-efficiency and asymptotic performance of each learning method. Ourexperiments show that our proposed algorithms are more robust to thehyper-parameter choice and reliably lead to strong results.

### Quick Read (beta)

# Promoting Coordination

through Policy Regularization in

Multi-Agent Reinforcement Learning

###### Abstract

A central challenge in multi-agent reinforcement learning is the induction of coordination between agents of a team. In this work, we investigate how to promote inter-agent coordination and discuss two possible avenues based respectively on inter-agent modelling and guided synchronized sub-policies. We test each approach in four challenging continuous control tasks with sparse rewards and compare them against three variants of MADDPG, a state-of-the-art multi-agent reinforcement learning algorithm. To ensure a fair comparison, we rely on a thorough hyper-parameter selection and training methodology that allows a fixed hyper-parameter search budget for each algorithm and environment. We consequently assess both the hyper-parameter sensitivity, sample-efficiency and asymptotic performance of each learning method. Our experiments show that our proposed algorithms are more robust to the hyper-parameter choice and reliably lead to strong results.

sources.bib

Promoting Coordination

through Policy Regularization in

Multi-Agent Reinforcement Learning

Paul Barde^{†}^{†}thanks: Equal contribution
Quebec AI institute (Mila)
McGill University
[email protected]
Julien Roy^{*}
Quebec AI institute (Mila)
Polytechnique Montréal
[email protected]
Félix G. Harvey
Quebec AI institute (Mila)
Polytechnique Montréal
Derek Nowrouzezahrai
Quebec AI institute (Mila)
McGill University
Christopher Pal
Quebec AI institute (Mila)
Polytechnique Montréal

noticebox[b]Preprint. Under review.\[email protected]

## 1 Introduction

Multi-agent Reinforcement Learning (MARL) refers to the task of training an agent to maximize its expected return by interacting with an environment that contains other learning agents. It represents a challenging branch of reinforcement learning with interesting developments in recent years [jaderberg2018human, lowe2017multi, foerster2018counterfactual, hernandez2018multiagent].

A popular framework for MARL is the use of a centralized training procedure and decentralized execution [lowe2017multi, foerster2018counterfactual, iqbal2018actor, foerster2018bayesian, rashid2018qmix], typically done by training critics that approximate the value of the joint observations and actions, themselves used to train actors that are restricted to the observation-action pair of a single agent. Such critics, if exposed to coordinated joint actions leading to high returns, can steer the agents’ policies toward these highly rewarding behaviors. However, these approaches depend on the agents luckily stumbling on coordinated joint actions in order to grasp their benefit. Thus, it might fail in scenarios where coordination is unlikely to occur by chance. We hypothesize that in such scenarios, coordination-promoting inductive biases on the policy search could help discover coordinated behaviors more efficiently and supersede task related reward shaping and curriculum learning.

In this work, we explore two different priors to successful coordination and use these to regularize the learned policies. The first avenue (TeamMADDPG) supposes that an agent must be able to predict the behavior of its teammates in order to coordinate with them. The second (CoachMADDPG) assumes that coordinating agents individually recognize different modes or situations, and synchronously use different sub-policies (or behaviors) to deal with these distinct events. In the following sections we show how to derive practical regularization terms from these premises and meticulously evaluate them^{1}^{1}
1
Source code for the algorithms and environments will be made public upon publication of this work..

Our contributions are twofold. First, we propose two novel algorithms, Team-MADDPG and Coach-MADDPG, that aim at promoting coordination in multi-agent systems. Our approaches build on the widely used MADDPG algotihm and augment it with additionnal multi-agent objectives that act as regularizers and are optimized jointly with the main return-maximisation objective. Second, we design four sparse-reward collaboration tasks in the multi-agent particle environment [mordatch2018emergence] and present a detailed evaluation of these algorithms against three baseline variations. We further analyze the effect of the induced behavioral bias by performing an ablation study which suggests that our team-spirit objective provides a dense learning signal that helps guiding the policy towards coordination in the absence of external reward and effectively leads it to the discovery of high performing team strategies in a number of cooperative tasks.

## 2 Background

### 2.1 Markov Games

In this work we consider the framework of Markov Games [littman1994markov], a multi-agent extension of Markov Decision Processes (MDPs) with $N$ independant agents. A Markov Game involves $N$ independent agents and is defined by the tuple $\u27e8\mathcal{S},\mathcal{T},\mathcal{P},{\{{\mathcal{O}}^{i},{\mathcal{A}}^{i},{\mathcal{R}}^{i}\}}_{i=1}^{N}\u27e9$. $\mathcal{S}$, $\mathcal{T}$, and $\mathcal{P}$ respectively are the set of all possible states, the transition function and the initial state distribution. While these are global properties of the environment, ${\mathcal{O}}^{i}$, ${\mathcal{A}}^{i}$ and ${\mathcal{R}}^{i}$ are individually defined for each agent $i$. They are respectively the observation functions, the sets of all possible actions and the reward functions. At each time-step $t$, the global state of the environment is given by ${s}_{t}\in \mathcal{S}$ and every agent’s individual action vector is denoted by ${a}_{t}^{i}\in {\mathcal{A}}^{i}$. To select their action, each agent $i$ has only access to its own observation vector ${o}_{t}^{i}$ which is extracted by its observation function ${\mathcal{O}}^{i}$ from the global state ${s}_{t}$. The initial global state ${s}_{1}$ is sampled from the initial state distribution $\mathcal{P}:\mathcal{S}\mapsto [0,1]$ and the next states of the environment ${s}_{t+1}$ are sampled from the probability distribution over the possible next states $P(\mathcal{S})$ given by the transition function $\mathcal{T}:\mathcal{S}\times {\mathcal{A}}_{1}\times \mathrm{\dots}\times {\mathcal{A}}_{N}\mapsto P(\mathcal{S})$. Finally, at each time-step each agent receives an individual scalar reward ${r}_{t}^{i}$ from its reward function ${\mathcal{R}}_{i}:\mathcal{S}\times {\mathcal{A}}_{1}\times \mathrm{\dots}\times {\mathcal{A}}_{N}\mapsto \mathbb{R}$. Agents aim at maximizing their expected discounted return $\mathbb{E}\left[{\sum}_{t=0}^{T}{\gamma}^{t}{r}_{t}^{i}\right]$ over the time horizon $T$, where $\gamma \in [0,1]$ is a discount factor.

### 2.2 Multi-Agent Deep Deterministic Policy Gradient (MADDPG)

MADDPG [lowe2017multi] is an adaptation of the Deep Deterministic Policy Gradient algorithm (DDPG) [lillicrap2015continuous] to the multi-agent setting. It allows the training of cooperating and competing decentralized policies through the use of a centralized training procedure. In this framework, each agent $i$ possesses its own deterministic policy ${\mu}^{i}$ for action selection and critic ${Q}^{i}$ for state-action value estimation which are respectively parametrized by ${\varphi}_{i}$ and ${\theta}_{i}$. All parametric models are trained off-policy from previous transitions $({\mathbf{o}}_{t},{\mathbf{a}}_{t},{\mathbf{r}}_{t},{\mathbf{o}}_{t+1})$ uniformly sampled from a replay buffer $\mathcal{D}$. Note that ${\mathbf{o}}_{t}=[{o}_{t}^{1},\mathrm{\dots},{o}_{t}^{N}]$ is the joint observation vector and ${\mathbf{a}}_{t}=[{a}_{t}^{1},\mathrm{\dots},{a}_{t}^{N}]$ is the joint action vector, obtained by concatenating the individual observation vectors ${o}_{t}^{i}$ and action vectors ${a}_{t}^{i}$ of all $N$ agents. Each centralized critic, is trained to estimate the expected return for a particular agent $i$ using the Deep Q-Network (DQN) [mnih2015human] loss:

$${\mathcal{L}}^{i}({\varphi}_{i})={\mathbb{E}}_{{\mathbf{o}}_{t},{\mathbf{a}}_{t},{\mathbf{o}}_{t+1},{r}_{t}^{i}\sim \mathcal{D}}\left[{\frac{1}{2}{\left({Q}^{i}({\mathbf{o}}_{t},{\mathbf{a}}_{t};{\varphi}_{i})-({r}_{t}^{i}+\gamma {Q}_{\text{target}}^{i}({\mathbf{o}}_{t+1},{\mathbf{a}}_{t+1}))\right)}^{2}|}_{{a}_{t+1}^{j}={\mu}^{j}({o}_{t+1}^{j})}\right]$$ | (1) |

Each policy is updated to maximize the expected discounted return of the corresponding agent-i :

$${J}_{PG}^{i}({\theta}_{i})={\mathbb{E}}_{{\mathbf{o}}_{t},{\mathbf{a}}_{t}\sim \mathcal{D}}\left[{{Q}^{i}({\mathbf{o}}_{t},{\mathbf{a}}_{t})|}_{{a}_{t}^{i}={\mu}^{i}({o}_{t}^{i};{\theta}^{i})}\right]$$ | (2) |

yielding the following policy gradient:

$${\nabla}_{{\theta}_{i}}{J}_{PG}^{i}({\theta}_{i})={\mathbb{E}}_{{\mathbf{o}}_{t},{\mathbf{a}}_{t}\sim \mathcal{D}}\left[{{\nabla}_{{\theta}_{i}}{\mu}^{i}({o}_{t}^{i};{\theta}_{i}){\nabla}_{{a}_{t}^{i}}{Q}^{i}({\mathbf{o}}_{t},{\mathbf{a}}_{t})|}_{{a}_{t}^{i}={\mu}^{i}({o}_{t}^{i};{\theta}^{i})}\right]$$ | (3) |

By guiding each agent’s policy toward states that have been more positively evaluated when taking into account all agents’ observation-action pairs, the centralized training of the value-function restores stationary in the multi-agent setting. In addition, this mechanism can allow to implicitly learn coordinated strategies that can then be deployed in a decentralized way. However, this procedure does not encourage the discovery of coordinated strategies since high-reward behaviors have to be randomly experienced through unguided exploration.

## 3 Related Work

Many works in multi-agent reinforcement learning consider explicit communication channels between the agents and distinguish between communicative actions (broadcasting a given message) and physical actions (moving in a given direction) [foerster2016learning, mordatch2018emergence, lazaridou2016multi]. Consequently, they often focus on the emergence of language, considering tasks where the agents must discover a common communication protocol in order to succeed. Deriving a successful communication protocol can already be seen as coordination in the communicative action space and can enable, to some extent, successful coordination in the physical action space [ahilan2019feudal]. However, explicit communication is not a necessary condition for coordination as agents can rely on physical communication [mordatch2018emergence, gupta2017cooperative].

Approaches to shape RL agents’ behaviors with respect to other agents have also been explored. Jacques et al. [jaques2018intrinsic] use the mutual information between the agent’s policy and a goal-independent policy to shape the agent’s behaviour towards hiding or spelling out its current goal. However, this approach is only applicable for tasks with an explicit goal representation and is not specifically intended for coordination. Strouse et al. [strouse2018learning] use the direct causal effect between agent’s actions as an intrinsic reward to encourage social empowerment, but rely on giving to some agents (the influencees) the action of other agents (the influencers) as input to their policy, which violates the decentralized execution framework.

Finally, Barton et al. [barton2018measuring] propose CCM (convergent cross mapping) as a metric for measuring the degree of effective coordination between two agents. However, as this may represent an interesting avenue for behavior analysis, it does not provide any tool for actually enforcing coordination as CCM must be computed over long time series which makes it impractical as a learning signal for single-step temporal difference based methods. In this work, we design two coordination-driven multi-agent algorithms that do not rely on the existence of explicit communication channel and allow to carry the learned coordinated behaviors at test time, when all agents act in a decentralized fashion.

## 4 Policy regularization

Our two proposed algorithms use team-objectives that are used as regularizers alongside the common policy gradient update. Pseudocodes of our implementations are provided in Appendix 8.2 (see Algorithm 8.2 and 8.2).

### 4.1 Team regularization

We hypothesize that coordination can be defined as the degree of predictability of one agent’s behavior with respect to its teammate(s). In other words, in a coordinated team, there should be some predictable structure between the agent’s actions. We cast this assumption into the decentralized framework by training agents to predict their teammates actions given only their own observation, which can be simply implemented by adding additional heads to each agent’s policy network. We define this regularization term as the Mean Squared Error (MSE) between the predicted and real actions of the teammates, yielding a teammate-modelling secondary objective similar to the models of other agents used in [jaques2018intrinsic, hern2019agent] and often referred to as agent modelling [schadd2007opponent]. Importantly, most previous work such as [hern2019agent] exclusively use this approach to improve the richness of the learned representations. In our case, the same objective is also used to drive the teammates’ behaviors closer to the prediction, a process that requires a differentiable action selection mechanism. We call this the Team-Spirit regularization term ${J}_{TS}^{i,j}$ between agents $i$ and $j$:

$${J}_{TS}^{i,j}({\theta}_{i},{\theta}_{j})=\text{MSE}({\widehat{a}}_{t}^{i,j},{a}_{t}^{j})=\frac{1}{2}\sum _{k=1}^{|{\mathcal{A}}^{j}|}{({\widehat{a}}_{t,k}^{i,j}-{a}_{t,k}^{j})}^{2}$$ | (4) |

The total gradient for a given agent-i becomes:

$${\nabla}_{{\theta}_{i}}{J}_{total}^{i}({\theta}_{i})={\nabla}_{{\theta}_{i}}{J}_{PG}^{i}({\theta}_{i})+{\lambda}_{1}\sum _{j}{\nabla}_{{\theta}_{i}}{J}_{TS}^{i,j}({\theta}_{i},{\theta}_{j})+{\lambda}_{2}\sum _{j}{\nabla}_{{\theta}_{i}}{J}_{TS}^{j,i}({\theta}_{j},{\theta}_{i})$$ | (5) |

Where ${\lambda}_{1}$ and ${\lambda}_{2}$ are hyper-parameters that weigh respectively how well an agent should predict its teammates’ actions, and how predictable an agent should be for its teammates. We call this modified algorithm TeamMADDPG. The diagram of Figure (a)a summarizes these interactions.

### 4.2 Coach regularization

This regularization choice builds on the assumption that coordination could be more easily achieved if agents would have the same representation of the current situation and could use this representation to synchronously select a sub-behavior. To promote this, we introduce the coach, a new entity parametrized by $\psi $, which learns to produce from the joint observations of all agents a discrete embedding (behavior-mask) ${e}_{t}^{c}$ that is passed to the agents at training time. This mask is a one-hot vector of size $k$ drawn from a multinomial distribution ${p}^{c}({e}_{t}^{c}|{\mathbf{o}}_{t};\psi )$. In practice, the policy network ${\mu}^{i}$ of each agent $i$ is modified as follows: a first linear pre-hidden layer is split into two heads. The first one consists of the the pre-activations of the hidden layer ${h}_{1}$. The second head, of dimension $k$, consists of the pre-activations of a softmax function yielding the distribution ${p}^{i}({e}_{t}^{i}|{o}_{t}^{i};{\theta}_{i})$ from which a behavior mask is sampled. The mask coming either from the coach or from the agent is then tiled along the dimensions of the first head of the linear layer and applied via element-wise multiplication, a process similar to dropout [srivastava2014dropout] but in a deterministic, structured fashion. The output forms ${h}_{1}$, consisting therefore of the masked pre-activations of the first linear head.

In other words, the policy of agent $i$ is conditioned via embedding-masking and ${a}_{t}^{i}=\mu ({o}_{t}^{i},{e}_{t};{\theta}_{i})$, where ${e}_{t}$, the behavior-mask at time $t$ can either be the agent’s private mask ${e}_{t}^{i}$ or the coach’s mask ${e}_{t}^{c}$. To allow this, the coach is trained to output masks that (1) yield high returns when passed to the agents and (2) are predictable by the agents. Similarly, each agent is regularized so that (1) its private mask matches the coach’s mask and (2) it derives efficient behavior from it. This scheme enforces coordinated sub-policies in agents, while having agents predict the best behavior that the coach would impose given all observations. The policy gradient loss when agent $i$ is provided with the coach’s plan is given by:

$${J}_{EPG}^{i}({\theta}_{i},\psi )={\mathbb{E}}_{{\mathbf{o}}_{t},{\mathbf{a}}_{t}\sim \mathcal{D}}\left[{{Q}^{i}({\mathbf{o}}_{t},{\mathbf{a}}_{t})|}_{{a}_{t}^{i}=\mu ({o}_{t}^{i},{e}_{t};{\theta}_{i}),{e}_{t}\sim {p}^{c}(\cdot |{\mathbf{o}}_{t};\psi )}\right]$$ | (6) |

And the difference between the plan of agent $i$ and the coach’s one is measured from the Kullback–Leibler divergence:

$${J}_{E}^{i}({\theta}_{i},\psi )={\mathbb{E}}_{{\mathbf{o}}_{t}\sim \mathcal{D}}\left[{\text{D}}_{\text{KL}}\left({p}^{c}(\cdot |{\mathbf{o}}_{t};\psi )\right||{p}^{i}(\cdot |{o}_{t}^{i};{\theta}_{i}))\right]$$ | (7) |

The total gradient for agent $i$ is:

$${\nabla}_{{\theta}_{i}}{J}_{total}^{i}({\theta}_{i},\psi )={\nabla}_{{\theta}_{i}}{J}_{PG}^{i}({\theta}_{i})+{\lambda}_{1}{\nabla}_{{\theta}_{i}}{J}_{E}^{i}({\theta}_{i},\psi )+{\lambda}_{2}{\nabla}_{{\theta}_{i}}{J}_{EPG}^{i}({\theta}_{i},\psi )$$ | (8) |

where:

$${\nabla}_{{\theta}_{i}}{J}_{EPG}^{i}({\theta}_{i},\psi )={\mathbb{E}}_{{\mathbf{o}}_{t},{\mathbf{a}}_{t}\sim \mathcal{D}}\left[{{\nabla}_{{\theta}_{i}}\mu ({o}_{t}^{i},{e}_{t};{\theta}_{i}){\nabla}_{{a}_{t}^{i}}{Q}^{i}({\mathbf{o}}_{t},{\mathbf{a}}_{t})|}_{{a}_{t}^{i}=\mu ({o}_{t}^{i},{e}_{t}),{e}_{t}\sim {p}^{c}(\cdot |{\mathbf{o}}_{t};\psi )}\right]$$ | (9) |

with ${\lambda}_{1}$ and ${\lambda}_{2}$ the regularization coefficients. Similarly for the coach:

$${\nabla}_{\psi}{J}_{total}^{c}(\psi )=\frac{1}{N}\sum _{i=1}^{N}\left({\nabla}_{\psi}{J}_{EPG}^{i}({\theta}_{i},\psi )+{\nabla}_{\psi}{J}_{E}^{i}({\theta}_{i},\psi )\right)$$ | (10) |

In order to propagate gradients through the sampled plan we reparametrized the multinomial distribution using the Gumbel-softmax trick [jang2016categorical]. We call this modified algorithm CoachMADDPG. The diagram of Figure (b)b summarizes these interactions.

## 5 Experimental setup

### 5.1 Training environments

All of our environments are based on a modified version of the OpenAI multi-agent particle environments [mordatch2018emergence]. While non-zero return can be achieved in these tasks by selfish agents, they all benefit from coordinated strategies and optimal return can only be achieved by agents working closely together. Figure 2 presents visualizations and a brief description of all four environments. A more detailed description is provided in Appendix 8.1. In all environments except Imitation (see Figure 2-d), agents receive as observations their own global position and velocity as well as the relative position of other agents and landmarks. Importantly, while most work showcasing experiments with this engine use discrete action spaces at least part of the time and usually learn on dense rewards (e.g. the inverse of the distance with the objective) [iqbal2018actor, lowe2017multi, jiang2018learning], in all of our experiments, agents learn on continuous actions spaces and from sparse rewards only.

### 5.2 Hyper-parameter tuning and training details

To offer a fair comparison between all methods, the hyper-parameter search routine is the same for each algorithm and environment. For each search-experiment (one per algorithm per environment), 50 randomly sampled hyper-parameter configurations are used to train the models for $15,000$ steps on 3 different training seeds. During training, the policies are evaluated on a fixed set of 10 different episodes every 100 learning steps. For a learned policy performance assessment, its best iteration-model is evaluated on 100 different episodes. The performance of a hyper-parameter configuration is defined as the performance of the corresponding learned policies averaged across training seeds.

We perform searches over the following hyper-parameters: the learning rate of the actor ${\alpha}_{\theta}$, the learning rate of the critic ${\omega}_{\varphi}$ relative to the actor (${\alpha}_{\varphi}={\omega}_{\varphi}*{\alpha}_{\theta}$), the target-network soft-update parameter $\tau $ and the initial scale of the exploration noise ${\eta}_{noise}$ for the Ornstein-Uhlenbeck noise generating process [uhlenbeck1930theory] (such as in [lillicrap2015continuous]). In the case of TeamMADDPG and CoachMADDP, we additionaly search over the regularization weights ${\lambda}_{1}$ and ${\lambda}_{2}$. The learning rate of the coach is always equal to the actor’s learning rate (i.e. ${\alpha}_{\theta}={\alpha}_{\psi}$), motivated by their similar architectures and learning signals and in order to reduce the search space. Table 1 in Appendix 8.4 shows the tunable hyper-parameters and their search ranges.

In all of our experiments, we use the Adam optimizer [kingma2014adam] to perform parameter updates. All models (actors and critics) are parametrized by MLPs containing two hidden layers of $128$ units each. We use the Rectified Linear Unit (ReLU) [nair2010rectified] as activation function and layer normalization [ba2016layer] to stabilize the learning. We use a buffer-size of ${10}^{6}$ entries and a batch-size of $1024$. We collect $100$ transitions by interacting with the environment for each learning update. For all tasks in our hyper-parameter searches, we train the agents for $15,000$ episodes of $100$ steps and then re-train the best configuration for each (algorithm, environment) pair for twice as long ($30,000$ episodes) to ensure full convergence for the final evaluation. The scale of the exploration noise is kept constant for the first half of the training time and then decreases linearly to $0$ until the end of training. We use a reward discount factor $\gamma $ of $0.95$ and a gradient clipping threshold of $0.5$ in all experiments.

## 6 Results and Discussion

We compare our algorithms (TeamMADDPG and CoachMADDPG) against MADDPG, DDPG and SharedMADDPG on the four cooperative multi-agent tasks described in Section 5.1. DDPG is the single-agent counterpart of MADDPG (decentralized training) whereas SharedMADDPG shares the policy and value-function models across agents. We report in Figure 3 the results of our per-algorithm hyper-parameter tuning as described in Section 5.2.

### 6.1 Robustness to hyper-parameters

Stability across hyper-parameter configurations is a recurring challenge in deep reinforcement learning. The average performance (across seeds) for each randomly sampled hyper-parameters configuration allows to empirically evaluate the robustness of an algorithm with respect to its hyper-parameters. Figure 3 shows that while multiple algorithms exhibit good maximum performance after $15,000$ training steps on several environments, our proposed coordination-regularizing strategies can boost robustness to hyper-parameter variations in many environments, which can be of great value with more limited search budgets.

### 6.2 Final performances

We retrain all algorithms with their most successful configuration for each environment, this time on 10 different seeds, and report the average learning curves in Figure 4. TeamMADDPG and CoachMADDPG outperform all other algorithms by important margins in Spread and Bounce, both in terms of sample efficiency and asymptotic performance. All algorithms perform competitively in Chase, with SharedMADDPG slightly above. However, SharedDDPG fails to learn the task in the three other environments (Spread, Bounce and Imitation). Finally, CoachMADDPG and DDPG obtain the best results in Imitation, with TeamMADDPG and MADDPG slightly lower.

Lowe et al. [lowe2017multi] suggest that DDPG’s lower performance can be caused by the absence of centralized critics which makes the environment appear non-stationary from any individual agent’s perspective and could explain that it performs best on the most asymmetric task, Imitation, where only one agent is influenced by the second one. The trailing results of SharedMADDPG might be explained by the fact that sharing parameters across agents prevents specialization, a useful asset even in homogeneous cooperative tasks. Finally, TeamMADDPG and CoachMADDPG perform competitively or better than MADDPG on all environments suggesting that learning team-related secondary tasks is a valuable way to foster efficient coordinated strategies resulting in faster learning and better performance.

### 6.3 The effects of being predictable

We aim to analyze here the effects of the regularizers of TeamMADDPG. Specifically, we ask if the regularizer weight ${\lambda}_{2}$ that pushes agents to be predictable by others is valuable. To this end, we compare one run of the best performing hyper-parameter configuration for TeamMADDPG on the Spread environment with two ablated versions. For the first ablation, we keep everything else fixed and retrain this model while setting ${\lambda}_{2}$ to zero. For the second ablation, we set both regularizers’ weights ${\lambda}_{1}$ and ${\lambda}_{2}$ to zero. The former variant corresponds to MADDPG only augmented by the side task of predicting others’ actions whereas the latter is exactly equivalent to the MADDPG algorithm, where agents neither try to predict others’ actions, nor to be predictable. The expected return and Team-Spirit loss defined in Section 4.1 averaged over the three agents of the environment are presented in Figure 5 for these three experiments.

At the beginning of training, due to the weight initialization, the outputted and predicted actions from agents both have relatively small norms, explaining the small Team-Spirit loss. As training goes on, the norms of the action-vector quickly increase and the regularization loss becomes more important. As it is to be expected, the non-regularized baseline (${\lambda}_{1}$ OFF $|$ ${\lambda}_{2}$ OFF) has a high Team-Spirit loss as it is in no way encouraged to predict the actions of other agents correctly. When reintegrating the agent-modelling objective (${\lambda}_{1}$ ON $|$ ${\lambda}_{2}$ OFF), the agents significantly reduce the team-spirit loss, but it never reaches values as low as for the full TeamMADDPG as we removed the incentive from agents to be predictable. Interestingly, this setting also performs slightly worse than the baseline overall on the task. We hypothesize that this is due to the fact that the regularizing task becomes too hard in this case, and that the learning signal from this auxiliary loss becomes too important in the learning dynamics, hindering performances. Finally, when also pushing agents to be predictable (${\lambda}_{1}$ ON $|$ ${\lambda}_{2}$ ON), the agent best predict others’ actions and performance is also improved. We also notice that the Team-Spirit loss increases when performance starts to increase i.e. when agents start to master the task. Once the reward maximisation signals becomes stronger, the relative importance of the second task is reduced. We hypothesize that being predictable with respect to one another may push agents to explore in a more structured and informed manner in the absence of reward signal, as similarly pursued by intrinsic motivation approaches [chentanez2005intrinsically].

## 7 Conclusion

In this work we introduced two policy regularization methods to promote multi-agent coordination. One is based on inter-agent action predictability and is called TeamMADDPG. The other one, CoachMADDPG, is based on synchronized and corresponding behavior selection with the help of an auxiliary coach network present only during training. We performed fair and transparent hyper-parameter searches for each compared algorithms and environments and found that our regularizing strategies have a positive effect on robustness to hyper-parameter selection and asymptotic performances, empirically showing the benefits of such coordination-promoting regularization in the cooperative multi-agent setting. Our techniques work with sparse-rewards and continuous action-spaces, and do not break the decentralized execution paradigm. One downside of our methods is that they are restricted in their current form to metrics that only account for the current timestep, a restriction that simplifies off-policy learning but also greatly impairs the longer-term coordination possibilities. In future work we aim explore model-based planning approaches for multi-agent coordination.

## Acknowledgement

We wish to thank Olivier Delalleau for providing insightful comments on previous versions of these methods as well as Fonds de Recherche Nature et Technologies (FRQNT), Ubisoft Montréal and Mitacs for providing part of the funding for this work.

## 8 Appendix

### 8.1 Tasks descriptions

Spread (Figure 2a): In this environment, there are 3 agents (small orange circles) and 3 landmarks (bigger gray circles). At every timestep, agents receive a team-reward ${r}_{t}=n$ where $n$ is the number of landmarks occupied by at least one agent. To maximize their return, agents must therefore spread themselves on all landmarks.

Bounce (Figure 2b): In this environment, two agents (small purple circles) are linked together with a spring that pulls them toward each-other when stretched above its relaxation length. At mid-point during the episode, a ball (smaller red circle) falls from the top of the environment. Agents must position correctly so as to make the ball bounce on the spring towards the target (bigger beige circle). They receive a team-reward of ${r}_{t}=0.1$ if the ball reflects towards the side walls, ${r}_{t}=0.2$ if the ball reflects towards the top of the environment, and ${r}_{t}=10$ if the ball reflects towards the target.

Chase (Figure 2c): In this environment, two predators (red circles) are chasing a prey (green circle). The prey moves with respect to a hardcoded policy consisting of repulsion forces from the walls and predators. At every timestep, the learning agents (predators) receive a team-reward of ${r}_{t}=n$ where $n$ is the number of predator touching the prey. The prey has a greater top speed and acceleration than the predators. Catching the prey does not trigger the end of the episode. Therefore, to maximize their return, the two agents must coordinate in order to squeeze the prey into a corner or a wall and effectively immobilizing it.

Imitation (Figure 2d): In this environment, at the beginning of the episode, one of the two set of landmarks (left or right) is randomly picked to be activated. The bottom agent (small blue circle) receives an individual-reward of ${r}_{t}=0.1$ for being in its activated landmark (bottom green bigger circle), and the top agent (small purple circle) is rewarded also individually rewarded for standing in its own landmark (top green bigger circle). The trick is that while the bottom agent know the relative position of its two landmarks as well as which one is activated, the top agent receives neither of these information and only has access to the relative position of the bottom agent. It must learn to imitate the bottom agent to reach its activated landmark. Importantly, a door closes its corridor once it has committed on one side, leading to a win or lose situation.

### 8.2 Algorithms

{algorithm}\[email protected]@algorithmic\STATERandomly initialize $N$ critic networks ${Q}^{i}$ and actor networks ${\mu}^{i}$ \STATEInitialize $N$ target networks ${Q}^{i\prime}$ and ${\mu}^{i\prime}$ \STATEInitialize one replay buffer $\mathcal{D}$ \FORepisode from 1 to number of episodes \STATEInitialize random processes ${\mathcal{N}}^{i}$ for action exploration \STATEReceive initial joint observation ${\text{\mathbf{o}}}_{1}$ \FORtimestep t from 1 to episode length \STATESelect action ${a}_{i}={\mu}^{i}({o}_{t}^{i})+{\mathcal{N}}_{t}^{i}$ for each agent \STATEExecute joint action ${\text{\mathbf{a}}}_{t}$ and observe joint reward ${\text{\mathbf{r}}}_{t}$ and new observation ${\text{\mathbf{o}}}_{t+1}$ \STATEStore transition $({\text{\mathbf{s}}}_{t},{\text{\mathbf{a}}}_{t},{\text{\mathbf{r}}}_{t},{\text{\mathbf{s}}}_{t+1})$ in $\mathcal{D}$ \ENDFOR\STATESample a random minibatch of $M$ transitions from $\mathcal{D}$ \FOReach agent $i$ \STATEEvaluate ${\mathcal{L}}^{i}$ and ${J}_{PG}^{i}$ from Equations (1) and (2) \FOReach other agent $j$ ($i\ne j$) \STATEEvaluate ${J}_{TS}^{i,j}$ from Equations (4) \STATEUpdate actor $j$ with ${\theta}^{j}\leftarrow {\theta}^{j}+{\alpha}_{\theta}{\nabla}_{{\theta}^{j}}{\lambda}_{2}{J}_{TS}^{i,j}$ \ENDFOR\STATEUpdate critic with ${\varphi}^{i}\leftarrow {\varphi}^{i}+{\alpha}_{\varphi}{\nabla}_{{\varphi}^{i}}{\mathcal{L}}^{i}$ \STATEUpdate actor $i$ with ${\theta}^{i}\leftarrow {\theta}^{i}+{\alpha}_{\theta}{\nabla}_{{\theta}^{i}}\left({J}_{PG}^{i}+{\lambda}_{1}{\sum}_{j=1}^{N}{J}_{TS}^{i,j}\right)$ \ENDFOR\STATEUpdate all target networks with ${Q}^{i\prime}\leftarrow (1-\tau ){Q}^{i\prime}+\tau {Q}^{i}$ and ${\mu}^{i\prime}\leftarrow (1-\tau ){\mu}^{i\prime}+\tau {\mu}^{i}$ \ENDFOR

[H] \[email protected]@algorithmic \STATERandomly initialize $N$ critic networks ${Q}^{i}$, actor networks ${\mu}^{i}$ and one coach network ${p}^{c}$ \STATEInitialize $N$ target networks ${Q}^{i\prime}$ and ${\mu}^{i\prime}$ \STATEInitialize one replay buffer $\mathcal{D}$ \FORepisode from 1 to number of episodes \STATEInitialize random processes ${\mathcal{N}}^{i}$ for action exploration \STATEReceive initial joint observation ${\text{\mathbf{o}}}_{1}$ \FORtimestep t from 1 to episode length \STATESelect action ${a}_{i}={\mu}^{i}({o}_{t}^{i})+{\mathcal{N}}_{t}^{i}$ for each agent \STATEExecute joint action ${\text{\mathbf{a}}}_{t}$ and observe joint reward ${\text{\mathbf{r}}}_{t}$ and new observation ${\text{\mathbf{o}}}_{t+1}$ \STATEStore transition $({\text{\mathbf{s}}}_{t},{\text{\mathbf{a}}}_{t},{\text{\mathbf{r}}}_{t},{\text{\mathbf{s}}}_{t+1})$ in $\mathcal{D}$ \ENDFOR\STATESample a random minibatch of $M$ transitions from $\mathcal{D}$ \FOReach agent $i$ \STATEEvaluate ${\mathcal{L}}^{i}$, ${J}_{PG}^{i}$, ${J}_{E}^{i}$ and ${J}_{EPG}^{i}$ from Equations (1), (2), (7) and (6) \STATEUpdate critic with ${\varphi}^{i}\leftarrow {\varphi}^{i}+{\alpha}_{\varphi}{\nabla}_{{\varphi}^{i}}{\mathcal{L}}^{i}$ \STATEUpdate actor with ${\theta}^{i}\leftarrow {\theta}^{i}+{\alpha}_{\theta}{\nabla}_{{\theta}^{i}}\left({J}_{PG}^{i}+{\lambda}_{2}{J}_{EPG}^{i}+{\lambda}_{1}{J}_{E}^{i}\right)$ \ENDFOR\STATEUpdate coach with $\psi \leftarrow \psi +{\alpha}_{\psi}{\nabla}_{\psi}\frac{1}{N}{\sum}_{i=1}^{N}\left({J}_{EPG}^{i}+{J}_{E}^{i}\right)$ \STATEUpdate all target networks with ${Q}^{i\prime}\leftarrow (1-\tau ){Q}^{i\prime}+\tau {Q}^{i}$ and ${\mu}^{i\prime}\leftarrow (1-\tau ){\mu}^{i\prime}+\tau {\mu}^{i}$ \ENDFOR

### 8.3 Hyper-parameter search results

### 8.4 Hyper-parameter search ranges

Table 1 shows the ranges in which values for the hyper-parameters were drawn uniformly during the searches, where ${\eta}_{noise}$ represents the initial exploration noise scale. In our experiments, the learning rates of the agents (${\alpha}_{\theta}$) and the coach network (${\alpha}_{\psi}$) are always equal, motivated by the similar architectures and learning signals, and in order to reduce the search space. For the critic networks, we aim for greater flexibility, while constraining their learning rate ${\alpha}_{\varphi}$ to be related to ${\alpha}_{\theta}$ and ${\alpha}_{\psi}$. We thus use a critic learning rate coefficient ${\omega}_{\varphi}$ so that ${\alpha}_{\varphi}={\omega}_{\varphi}*{\alpha}_{\theta}$.

Hyper-parameter | Range |
---|---|

$\mathrm{log}({\alpha}_{\theta})$ | $[-8,-3]$ |

$\mathrm{log}({\omega}_{\varphi})$ | $[-2,\mathrm{\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}2}]$ |

$\mathrm{log}(\tau )$ | $[-3,-1]$ |

$\mathrm{log}({\lambda}_{1})$ | $[-3,\mathrm{\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}0}]$ |

$\mathrm{log}({\lambda}_{2})$ | $[-3,\mathrm{\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}0}]$ |

${\eta}_{noise}$ | $[0.3,1.8]$ |

### 8.5 Selected hyper-parameters

Tables 2, 3, 4, and 5 shows the best hyper-parameters found by the random searches for each of the environments and each of the algorithms.

Hyper-parameter | DDPG | MADDPG | SharedMADDPG | TeamMADDPG | CoachMADDPG |
---|---|---|---|---|---|

${\alpha}_{\theta}$ | $5.3*{10}^{-5}$ | $2.1*{10}^{-5}$ | $9.0*{10}^{-4}$ | $2.5*{10}^{-5}$ | $2.4*{10}^{-5}$ |

${\omega}_{\varphi}$ | $53$ | $79$ | $0.71$ | $42$ | $9.8$ |

$\tau $ | $0.05$ | $0.083$ | $0.076$ | $0.098$ | $0.019$ |

${\lambda}_{1}$ | - | - | - | $0.054$ | $0.21$ |

${\lambda}_{2}$ | - | - | - | $0.29$ | $0.46$ |

${\eta}_{noise}$ | $1.0$ | $0.5$ | $0.7$ | $1.2$ | $1.2$ |

Hyper-parameter | DDPG | MADDPG | SharedMADDPG | TeamMADDPG | CoachMADDPG |
---|---|---|---|---|---|

${\alpha}_{\theta}$ | $8.1*{10}^{-4}$ | $3.8*{10}^{-5}$ | $1.2*{10}^{-4}$ | $1.3*{10}^{-5}$ | $2.0*{10}^{-5}$ |

${\omega}_{\varphi}$ | $2.4$ | $87$ | $0.47$ | $85$ | $55$ |

$\tau $ | $0.089$ | $0.016$ | $0.06$ | $0.055$ | $0.042$ |

${\lambda}_{1}$ | - | - | - | $0.06$ | $0.0027$ |

${\lambda}_{2}$ | - | - | - | $0.0026$ | $0.0024$ |

${\eta}_{noise}$ | $1.2$ | $0.9$ | $1.2$ | $1.0$ | $1.4$ |

Hyper-parameter | DDPG | MADDPG | SharedMADDPG | TeamMADDPG | CoachMADDPG |
---|---|---|---|---|---|

${\alpha}_{\theta}$ | $4.5*{10}^{-4}$ | $2.0*{10}^{-4}$ | $9.7*{10}^{-4}$ | $1.3*{10}^{-5}$ | $9.5*{10}^{-4}$ |

${\omega}_{\varphi}$ | $32$ | $64$ | $0.79$ | $85$ | $7.3$ |

$\tau $ | $0.031$ | $0.021$ | $0.032$ | $0.055$ | $0.068$ |

${\lambda}_{1}$ | - | - | - | $0.06$ | $0.058$ |

${\lambda}_{2}$ | - | - | - | $0.0026$ | $0.017$ |

${\eta}_{noise}$ | $0.6$ | $1.0$ | $1.5$ | $1.0$ | $1.2$ |

Hyper-parameter | DDPG | MADDPG | SharedMADDPG | TeamMADDPG | CoachMADDPG |
---|---|---|---|---|---|

${\alpha}_{\theta}$ | $4.4*{10}^{-6}$ | $3.3*{10}^{-4}$ | $9.2*{10}^{-4}$ | $3.7*{10}^{-4}$ | $2.0*{10}^{-6}$ |

${\omega}_{\varphi}$ | $42$ | $0.97$ | $66$ | $0.79$ | $64$ |

$\tau $ | $0.04$ | $0.0092$ | $0.015$ | $0.041$ | $0.049$ |

${\lambda}_{1}$ | - | - | - | $0.032$ | $0.0044$ |

${\lambda}_{2}$ | - | - | - | $0.0013$ | $0.0014$ |

${\eta}_{noise}$ | $1.6$ | $1.1$ | $1.0$ | $0.4$ | $1.8$ |