Promoting Coordination through Policy Regularization in Multi-Agent Reinforcement Learning

  • 2019-08-06 17:48:17
  • Paul Barde, Julien Roy, Félix G. Harvey, Derek Nowrouzezahrai, Christopher Pal
  • 1


A central challenge in multi-agent reinforcement learning is the induction ofcoordination between agents of a team. In this work, we investigate how topromote inter-agent coordination and discuss two possible avenues basedrespectively on inter-agent modelling and guided synchronized sub-policies. Wetest each approach in four challenging continuous control tasks with sparserewards and compare them against three variants of MADDPG, a state-of-the-artmulti-agent reinforcement learning algorithm. To ensure a fair comparison, werely on a thorough hyper-parameter selection and training methodology thatallows a fixed hyper-parameter search budget for each algorithm andenvironment. We consequently assess both the hyper-parameter sensitivity,sample-efficiency and asymptotic performance of each learning method. Ourexperiments show that our proposed algorithms are more robust to thehyper-parameter choice and reliably lead to strong results.


Quick Read (beta)

Promoting Coordination
through Policy Regularization in
Multi-Agent Reinforcement Learning

Paul Barde
Quebec AI institute (Mila)
McGill University
[email protected]
&Julien Roy*
Quebec AI institute (Mila)
Polytechnique Montréal
[email protected]
\ANDFélix G. Harvey
Quebec AI institute (Mila)
Polytechnique Montréal
&Derek Nowrouzezahrai
Quebec AI institute (Mila)
McGill University
&Christopher Pal
Quebec AI institute (Mila)
Polytechnique Montréal
Equal contribution

A central challenge in multi-agent reinforcement learning is the induction of coordination between agents of a team. In this work, we investigate how to promote inter-agent coordination and discuss two possible avenues based respectively on inter-agent modelling and guided synchronized sub-policies. We test each approach in four challenging continuous control tasks with sparse rewards and compare them against three variants of MADDPG, a state-of-the-art multi-agent reinforcement learning algorithm. To ensure a fair comparison, we rely on a thorough hyper-parameter selection and training methodology that allows a fixed hyper-parameter search budget for each algorithm and environment. We consequently assess both the hyper-parameter sensitivity, sample-efficiency and asymptotic performance of each learning method. Our experiments show that our proposed algorithms are more robust to the hyper-parameter choice and reliably lead to strong results.




Promoting Coordination
through Policy Regularization in
Multi-Agent Reinforcement Learning

  Paul Bardethanks: Equal contribution Quebec AI institute (Mila) McGill University [email protected] Julien Roy* Quebec AI institute (Mila) Polytechnique Montréal [email protected] Félix G. Harvey Quebec AI institute (Mila) Polytechnique Montréal Derek Nowrouzezahrai Quebec AI institute (Mila) McGill University Christopher Pal Quebec AI institute (Mila) Polytechnique Montréal


noticebox[b]Preprint. Under review.\[email protected]

1 Introduction

Multi-agent Reinforcement Learning (MARL) refers to the task of training an agent to maximize its expected return by interacting with an environment that contains other learning agents. It represents a challenging branch of reinforcement learning with interesting developments in recent years [jaderberg2018human, lowe2017multi, foerster2018counterfactual, hernandez2018multiagent].

A popular framework for MARL is the use of a centralized training procedure and decentralized execution [lowe2017multi, foerster2018counterfactual, iqbal2018actor, foerster2018bayesian, rashid2018qmix], typically done by training critics that approximate the value of the joint observations and actions, themselves used to train actors that are restricted to the observation-action pair of a single agent. Such critics, if exposed to coordinated joint actions leading to high returns, can steer the agents’ policies toward these highly rewarding behaviors. However, these approaches depend on the agents luckily stumbling on coordinated joint actions in order to grasp their benefit. Thus, it might fail in scenarios where coordination is unlikely to occur by chance. We hypothesize that in such scenarios, coordination-promoting inductive biases on the policy search could help discover coordinated behaviors more efficiently and supersede task related reward shaping and curriculum learning.

In this work, we explore two different priors to successful coordination and use these to regularize the learned policies. The first avenue (TeamMADDPG) supposes that an agent must be able to predict the behavior of its teammates in order to coordinate with them. The second (CoachMADDPG) assumes that coordinating agents individually recognize different modes or situations, and synchronously use different sub-policies (or behaviors) to deal with these distinct events. In the following sections we show how to derive practical regularization terms from these premises and meticulously evaluate them11 1 Source code for the algorithms and environments will be made public upon publication of this work..

Our contributions are twofold. First, we propose two novel algorithms, Team-MADDPG and Coach-MADDPG, that aim at promoting coordination in multi-agent systems. Our approaches build on the widely used MADDPG algotihm and augment it with additionnal multi-agent objectives that act as regularizers and are optimized jointly with the main return-maximisation objective. Second, we design four sparse-reward collaboration tasks in the multi-agent particle environment [mordatch2018emergence] and present a detailed evaluation of these algorithms against three baseline variations. We further analyze the effect of the induced behavioral bias by performing an ablation study which suggests that our team-spirit objective provides a dense learning signal that helps guiding the policy towards coordination in the absence of external reward and effectively leads it to the discovery of high performing team strategies in a number of cooperative tasks.

2 Background

2.1 Markov Games

In this work we consider the framework of Markov Games [littman1994markov], a multi-agent extension of Markov Decision Processes (MDPs) with N independant agents. A Markov Game involves N independent agents and is defined by the tuple 𝒮,𝒯,𝒫,{𝒪i,𝒜i,i}i=1N. 𝒮, 𝒯, and 𝒫 respectively are the set of all possible states, the transition function and the initial state distribution. While these are global properties of the environment, 𝒪i, 𝒜i and i are individually defined for each agent i. They are respectively the observation functions, the sets of all possible actions and the reward functions. At each time-step t, the global state of the environment is given by st𝒮 and every agent’s individual action vector is denoted by ati𝒜i. To select their action, each agent i has only access to its own observation vector oti which is extracted by its observation function 𝒪i from the global state st. The initial global state s1 is sampled from the initial state distribution 𝒫:𝒮[0,1] and the next states of the environment st+1 are sampled from the probability distribution over the possible next states P(𝒮) given by the transition function 𝒯:𝒮×𝒜1××𝒜NP(𝒮). Finally, at each time-step each agent receives an individual scalar reward rti from its reward function i:𝒮×𝒜1××𝒜N. Agents aim at maximizing their expected discounted return 𝔼[t=0Tγtrti] over the time horizon T, where γ[0,1] is a discount factor.

2.2 Multi-Agent Deep Deterministic Policy Gradient (MADDPG)

MADDPG [lowe2017multi] is an adaptation of the Deep Deterministic Policy Gradient algorithm (DDPG) [lillicrap2015continuous] to the multi-agent setting. It allows the training of cooperating and competing decentralized policies through the use of a centralized training procedure. In this framework, each agent i possesses its own deterministic policy μi for action selection and critic Qi for state-action value estimation which are respectively parametrized by ϕi and θi. All parametric models are trained off-policy from previous transitions (𝐨t,𝐚t,𝐫t,𝐨t+1) uniformly sampled from a replay buffer 𝒟. Note that 𝐨t=[ot1,,otN] is the joint observation vector and 𝐚t=[at1,,atN] is the joint action vector, obtained by concatenating the individual observation vectors oti and action vectors ati of all N agents. Each centralized critic, is trained to estimate the expected return for a particular agent i using the Deep Q-Network (DQN) [mnih2015human] loss:

i(ϕi)=𝔼𝐨t,𝐚t,𝐨t+1,rti𝒟[12(Qi(𝐨t,𝐚t;ϕi)-(rti+γQtargeti(𝐨t+1,𝐚t+1)))2|at+1j=μj(ot+1j)] (1)

Each policy is updated to maximize the expected discounted return of the corresponding agent-i :

JPGi(θi)=𝔼𝐨t,𝐚t𝒟[Qi(𝐨t,𝐚t)|ati=μi(oti;θi)] (2)

yielding the following policy gradient:

θiJPGi(θi)=𝔼𝐨t,𝐚t𝒟[θiμi(oti;θi)atiQi(𝐨t,𝐚t)|ati=μi(oti;θi)] (3)

By guiding each agent’s policy toward states that have been more positively evaluated when taking into account all agents’ observation-action pairs, the centralized training of the value-function restores stationary in the multi-agent setting. In addition, this mechanism can allow to implicitly learn coordinated strategies that can then be deployed in a decentralized way. However, this procedure does not encourage the discovery of coordinated strategies since high-reward behaviors have to be randomly experienced through unguided exploration.

3 Related Work

Many works in multi-agent reinforcement learning consider explicit communication channels between the agents and distinguish between communicative actions (broadcasting a given message) and physical actions (moving in a given direction) [foerster2016learning, mordatch2018emergence, lazaridou2016multi]. Consequently, they often focus on the emergence of language, considering tasks where the agents must discover a common communication protocol in order to succeed. Deriving a successful communication protocol can already be seen as coordination in the communicative action space and can enable, to some extent, successful coordination in the physical action space [ahilan2019feudal]. However, explicit communication is not a necessary condition for coordination as agents can rely on physical communication [mordatch2018emergence, gupta2017cooperative].

Approaches to shape RL agents’ behaviors with respect to other agents have also been explored. Jacques et al. [jaques2018intrinsic] use the mutual information between the agent’s policy and a goal-independent policy to shape the agent’s behaviour towards hiding or spelling out its current goal. However, this approach is only applicable for tasks with an explicit goal representation and is not specifically intended for coordination. Strouse et al. [strouse2018learning] use the direct causal effect between agent’s actions as an intrinsic reward to encourage social empowerment, but rely on giving to some agents (the influencees) the action of other agents (the influencers) as input to their policy, which violates the decentralized execution framework.

Finally, Barton et al. [barton2018measuring] propose CCM (convergent cross mapping) as a metric for measuring the degree of effective coordination between two agents. However, as this may represent an interesting avenue for behavior analysis, it does not provide any tool for actually enforcing coordination as CCM must be computed over long time series which makes it impractical as a learning signal for single-step temporal difference based methods. In this work, we design two coordination-driven multi-agent algorithms that do not rely on the existence of explicit communication channel and allow to carry the learned coordinated behaviors at test time, when all agents act in a decentralized fashion.

4 Policy regularization

Our two proposed algorithms use team-objectives that are used as regularizers alongside the common policy gradient update. Pseudocodes of our implementations are provided in Appendix 8.2 (see Algorithm 8.2 and 8.2).

4.1 Team regularization

We hypothesize that coordination can be defined as the degree of predictability of one agent’s behavior with respect to its teammate(s). In other words, in a coordinated team, there should be some predictable structure between the agent’s actions. We cast this assumption into the decentralized framework by training agents to predict their teammates actions given only their own observation, which can be simply implemented by adding additional heads to each agent’s policy network. We define this regularization term as the Mean Squared Error (MSE) between the predicted and real actions of the teammates, yielding a teammate-modelling secondary objective similar to the models of other agents used in [jaques2018intrinsic, hern2019agent] and often referred to as agent modelling [schadd2007opponent]. Importantly, most previous work such as [hern2019agent] exclusively use this approach to improve the richness of the learned representations. In our case, the same objective is also used to drive the teammates’ behaviors closer to the prediction, a process that requires a differentiable action selection mechanism. We call this the Team-Spirit regularization term JTSi,j between agents i and j:

JTSi,j(θi,θj)=MSE(a^ti,j,atj)=12k=1|𝒜j|(a^t,ki,j-at,kj)2 (4)

The total gradient for a given agent-i becomes:

θiJtotali(θi)=θiJPGi(θi)+λ1jθiJTSi,j(θi,θj)+λ2jθiJTSj,i(θj,θi) (5)

Where λ1 and λ2 are hyper-parameters that weigh respectively how well an agent should predict its teammates’ actions, and how predictable an agent should be for its teammates. We call this modified algorithm TeamMADDPG. The diagram of Figure (a)a summarizes these interactions.

(a) TeamMADDPG
(b) CoachMADDPG
Figure 1: Illustrations of the proposed regularization schemes in a two-agents MARL problem. (a) Each agent’s policy is equipped with additional heads that are trained to predict other agents’ actions and every agent is regularized to produce actions that its teammates correctly predict. Note that this regularization is depicted for agent-1 only to avoid cluttering. (b) A central model called Coach takes all agents’ observations as input and outputs the current mode (one-hot vector). Agents are regularized to predict the same mode from their local observations only, and use it as a mask to select their corresponding subpolicy.

4.2 Coach regularization

This regularization choice builds on the assumption that coordination could be more easily achieved if agents would have the same representation of the current situation and could use this representation to synchronously select a sub-behavior. To promote this, we introduce the coach, a new entity parametrized by ψ, which learns to produce from the joint observations of all agents a discrete embedding (behavior-mask) etc that is passed to the agents at training time. This mask is a one-hot vector of size k drawn from a multinomial distribution pc(etc|𝐨t;ψ). In practice, the policy network μi of each agent i is modified as follows: a first linear pre-hidden layer is split into two heads. The first one consists of the the pre-activations of the hidden layer h1. The second head, of dimension k, consists of the pre-activations of a softmax function yielding the distribution pi(eti|oti;θi) from which a behavior mask is sampled. The mask coming either from the coach or from the agent is then tiled along the dimensions of the first head of the linear layer and applied via element-wise multiplication, a process similar to dropout [srivastava2014dropout] but in a deterministic, structured fashion. The output forms h1, consisting therefore of the masked pre-activations of the first linear head.

In other words, the policy of agent i is conditioned via embedding-masking and ati=μ(oti,et;θi), where et, the behavior-mask at time t can either be the agent’s private mask eti or the coach’s mask etc. To allow this, the coach is trained to output masks that (1) yield high returns when passed to the agents and (2) are predictable by the agents. Similarly, each agent is regularized so that (1) its private mask matches the coach’s mask and (2) it derives efficient behavior from it. This scheme enforces coordinated sub-policies in agents, while having agents predict the best behavior that the coach would impose given all observations. The policy gradient loss when agent i is provided with the coach’s plan is given by:

JEPGi(θi,ψ)=𝔼𝐨t,𝐚t𝒟[Qi(𝐨t,𝐚t)|ati=μ(oti,et;θi),etpc(|𝐨t;ψ)] (6)

And the difference between the plan of agent i and the coach’s one is measured from the Kullback–Leibler divergence:

JEi(θi,ψ)=𝔼𝐨t𝒟[DKL(pc(|𝐨t;ψ)||pi(|oti;θi))] (7)

The total gradient for agent i is:

θiJtotali(θi,ψ)=θiJPGi(θi)+λ1θiJEi(θi,ψ)+λ2θiJEPGi(θi,ψ) (8)


θiJEPGi(θi,ψ)=𝔼𝐨t,𝐚t𝒟[θiμ(oti,et;θi)atiQi(𝐨t,𝐚t)|ati=μ(oti,et),etpc(|𝐨t;ψ)] (9)

with λ1 and λ2 the regularization coefficients. Similarly for the coach:

ψJtotalc(ψ)=1Ni=1N(ψJEPGi(θi,ψ)+ψJEi(θi,ψ)) (10)

In order to propagate gradients through the sampled plan we reparametrized the multinomial distribution using the Gumbel-softmax trick [jang2016categorical]. We call this modified algorithm CoachMADDPG. The diagram of Figure (b)b summarizes these interactions.

5 Experimental setup

5.1 Training environments

All of our environments are based on a modified version of the OpenAI multi-agent particle environments [mordatch2018emergence]. While non-zero return can be achieved in these tasks by selfish agents, they all benefit from coordinated strategies and optimal return can only be achieved by agents working closely together. Figure 2 presents visualizations and a brief description of all four environments. A more detailed description is provided in Appendix 8.1. In all environments except Imitation (see Figure 2-d), agents receive as observations their own global position and velocity as well as the relative position of other agents and landmarks. Importantly, while most work showcasing experiments with this engine use discrete action spaces at least part of the time and usually learn on dense rewards (e.g. the inverse of the distance with the objective) [iqbal2018actor, lowe2017multi, jiang2018learning], in all of our experiments, agents learn on continuous actions spaces and from sparse rewards only.

Figure 2: Multi-agent environments used in this work. (a) Spread: Agents must spread themselves on the different landmarks. (b) Bounce: Two agents are linked together by a spring and must position themselves so that the red ball bounces towards a target. (c) Chase: Two agents (red) chase a scripted prey (green) that moves w.r.t repulsion forces. (d) Imitation: The top agent (purple) needs to learn on which side the activated landmark is only by looking at the bottom agent’s behavior.

5.2 Hyper-parameter tuning and training details

To offer a fair comparison between all methods, the hyper-parameter search routine is the same for each algorithm and environment. For each search-experiment (one per algorithm per environment), 50 randomly sampled hyper-parameter configurations are used to train the models for 15,000 steps on 3 different training seeds. During training, the policies are evaluated on a fixed set of 10 different episodes every 100 learning steps. For a learned policy performance assessment, its best iteration-model is evaluated on 100 different episodes. The performance of a hyper-parameter configuration is defined as the performance of the corresponding learned policies averaged across training seeds.

We perform searches over the following hyper-parameters: the learning rate of the actor αθ, the learning rate of the critic ωϕ relative to the actor (αϕ=ωϕ*αθ), the target-network soft-update parameter τ and the initial scale of the exploration noise ηnoise for the Ornstein-Uhlenbeck noise generating process [uhlenbeck1930theory] (such as in [lillicrap2015continuous]). In the case of TeamMADDPG and CoachMADDP, we additionaly search over the regularization weights λ1 and λ2. The learning rate of the coach is always equal to the actor’s learning rate (i.e. αθ=αψ), motivated by their similar architectures and learning signals and in order to reduce the search space. Table 1 in Appendix 8.4 shows the tunable hyper-parameters and their search ranges.

In all of our experiments, we use the Adam optimizer [kingma2014adam] to perform parameter updates. All models (actors and critics) are parametrized by MLPs containing two hidden layers of 128 units each. We use the Rectified Linear Unit (ReLU) [nair2010rectified] as activation function and layer normalization [ba2016layer] to stabilize the learning. We use a buffer-size of 106 entries and a batch-size of 1024. We collect 100 transitions by interacting with the environment for each learning update. For all tasks in our hyper-parameter searches, we train the agents for 15,000 episodes of 100 steps and then re-train the best configuration for each (algorithm, environment) pair for twice as long (30,000 episodes) to ensure full convergence for the final evaluation. The scale of the exploration noise is kept constant for the first half of the training time and then decreases linearly to 0 until the end of training. We use a reward discount factor γ of 0.95 and a gradient clipping threshold of 0.5 in all experiments.

6 Results and Discussion

We compare our algorithms (TeamMADDPG and CoachMADDPG) against MADDPG, DDPG and SharedMADDPG on the four cooperative multi-agent tasks described in Section 5.1. DDPG is the single-agent counterpart of MADDPG (decentralized training) whereas SharedMADDPG shares the policy and value-function models across agents. We report in Figure 3 the results of our per-algorithm hyper-parameter tuning as described in Section 5.2.

6.1 Robustness to hyper-parameters

Figure 3: Summarized performance distributions of the sampled hyperparameters configurations for each (algorithm, environment) pair. The box-plots divide in quartiles the 49 lower-performing configurations for each distribution while the score of the best-performing configuration is highlighted above the box-plots by a single dot.

Stability across hyper-parameter configurations is a recurring challenge in deep reinforcement learning. The average performance (across seeds) for each randomly sampled hyper-parameters configuration allows to empirically evaluate the robustness of an algorithm with respect to its hyper-parameters. Figure 3 shows that while multiple algorithms exhibit good maximum performance after 15,000 training steps on several environments, our proposed coordination-regularizing strategies can boost robustness to hyper-parameter variations in many environments, which can be of great value with more limited search budgets.

6.2 Final performances

We retrain all algorithms with their most successful configuration for each environment, this time on 10 different seeds, and report the average learning curves in Figure 4. TeamMADDPG and CoachMADDPG outperform all other algorithms by important margins in Spread and Bounce, both in terms of sample efficiency and asymptotic performance. All algorithms perform competitively in Chase, with SharedMADDPG slightly above. However, SharedDDPG fails to learn the task in the three other environments (Spread, Bounce and Imitation). Finally, CoachMADDPG and DDPG obtain the best results in Imitation, with TeamMADDPG and MADDPG slightly lower.

Lowe et al. [lowe2017multi] suggest that DDPG’s lower performance can be caused by the absence of centralized critics which makes the environment appear non-stationary from any individual agent’s perspective and could explain that it performs best on the most asymmetric task, Imitation, where only one agent is influenced by the second one. The trailing results of SharedMADDPG might be explained by the fact that sharing parameters across agents prevents specialization, a useful asset even in homogeneous cooperative tasks. Finally, TeamMADDPG and CoachMADDPG perform competitively or better than MADDPG on all environments suggesting that learning team-related secondary tasks is a valuable way to foster efficient coordinated strategies resulting in faster learning and better performance.

Figure 4: Average learning curves for all algorithms on all six environments. Solid lines are the mean across 10 training seeds of the evaluation performance on 10 fixed episodes. The envelopes are the standard error across the 10 training seeds.

6.3 The effects of being predictable

We aim to analyze here the effects of the regularizers of TeamMADDPG. Specifically, we ask if the regularizer weight λ2 that pushes agents to be predictable by others is valuable. To this end, we compare one run of the best performing hyper-parameter configuration for TeamMADDPG on the Spread environment with two ablated versions. For the first ablation, we keep everything else fixed and retrain this model while setting λ2 to zero. For the second ablation, we set both regularizers’ weights λ1 and λ2 to zero. The former variant corresponds to MADDPG only augmented by the side task of predicting others’ actions whereas the latter is exactly equivalent to the MADDPG algorithm, where agents neither try to predict others’ actions, nor to be predictable. The expected return and Team-Spirit loss defined in Section 4.1 averaged over the three agents of the environment are presented in Figure 5 for these three experiments.

Figure 5: Comparison between enabling and disabling the regularizing weights λ1 and λ2 in TeamMADDPG on a successful configuration for the Spread environment

At the beginning of training, due to the weight initialization, the outputted and predicted actions from agents both have relatively small norms, explaining the small Team-Spirit loss. As training goes on, the norms of the action-vector quickly increase and the regularization loss becomes more important. As it is to be expected, the non-regularized baseline (λ1 OFF | λ2 OFF) has a high Team-Spirit loss as it is in no way encouraged to predict the actions of other agents correctly. When reintegrating the agent-modelling objective (λ1 ON | λ2 OFF), the agents significantly reduce the team-spirit loss, but it never reaches values as low as for the full TeamMADDPG as we removed the incentive from agents to be predictable. Interestingly, this setting also performs slightly worse than the baseline overall on the task. We hypothesize that this is due to the fact that the regularizing task becomes too hard in this case, and that the learning signal from this auxiliary loss becomes too important in the learning dynamics, hindering performances. Finally, when also pushing agents to be predictable (λ1 ON | λ2 ON), the agent best predict others’ actions and performance is also improved. We also notice that the Team-Spirit loss increases when performance starts to increase i.e. when agents start to master the task. Once the reward maximisation signals becomes stronger, the relative importance of the second task is reduced. We hypothesize that being predictable with respect to one another may push agents to explore in a more structured and informed manner in the absence of reward signal, as similarly pursued by intrinsic motivation approaches [chentanez2005intrinsically].

7 Conclusion

In this work we introduced two policy regularization methods to promote multi-agent coordination. One is based on inter-agent action predictability and is called TeamMADDPG. The other one, CoachMADDPG, is based on synchronized and corresponding behavior selection with the help of an auxiliary coach network present only during training. We performed fair and transparent hyper-parameter searches for each compared algorithms and environments and found that our regularizing strategies have a positive effect on robustness to hyper-parameter selection and asymptotic performances, empirically showing the benefits of such coordination-promoting regularization in the cooperative multi-agent setting. Our techniques work with sparse-rewards and continuous action-spaces, and do not break the decentralized execution paradigm. One downside of our methods is that they are restricted in their current form to metrics that only account for the current timestep, a restriction that simplifies off-policy learning but also greatly impairs the longer-term coordination possibilities. In future work we aim explore model-based planning approaches for multi-agent coordination.


We wish to thank Olivier Delalleau for providing insightful comments on previous versions of these methods as well as Fonds de Recherche Nature et Technologies (FRQNT), Ubisoft Montréal and Mitacs for providing part of the funding for this work.


8 Appendix

8.1 Tasks descriptions

Spread (Figure 2a): In this environment, there are 3 agents (small orange circles) and 3 landmarks (bigger gray circles). At every timestep, agents receive a team-reward rt=n where n is the number of landmarks occupied by at least one agent. To maximize their return, agents must therefore spread themselves on all landmarks.

Bounce (Figure 2b): In this environment, two agents (small purple circles) are linked together with a spring that pulls them toward each-other when stretched above its relaxation length. At mid-point during the episode, a ball (smaller red circle) falls from the top of the environment. Agents must position correctly so as to make the ball bounce on the spring towards the target (bigger beige circle). They receive a team-reward of rt=0.1 if the ball reflects towards the side walls, rt=0.2 if the ball reflects towards the top of the environment, and rt=10 if the ball reflects towards the target.

Chase (Figure 2c): In this environment, two predators (red circles) are chasing a prey (green circle). The prey moves with respect to a hardcoded policy consisting of repulsion forces from the walls and predators. At every timestep, the learning agents (predators) receive a team-reward of rt=n where n is the number of predator touching the prey. The prey has a greater top speed and acceleration than the predators. Catching the prey does not trigger the end of the episode. Therefore, to maximize their return, the two agents must coordinate in order to squeeze the prey into a corner or a wall and effectively immobilizing it.

Imitation (Figure 2d): In this environment, at the beginning of the episode, one of the two set of landmarks (left or right) is randomly picked to be activated. The bottom agent (small blue circle) receives an individual-reward of rt=0.1 for being in its activated landmark (bottom green bigger circle), and the top agent (small purple circle) is rewarded also individually rewarded for standing in its own landmark (top green bigger circle). The trick is that while the bottom agent know the relative position of its two landmarks as well as which one is activated, the top agent receives neither of these information and only has access to the relative position of the bottom agent. It must learn to imitate the bottom agent to reach its activated landmark. Importantly, a door closes its corridor once it has committed on one side, leading to a win or lose situation.

8.2 Algorithms

{algorithm}\[email protected]@algorithmic

TeamMADDPG \STATERandomly initialize N critic networks Qi and actor networks μi \STATEInitialize N target networks Qi and μi \STATEInitialize one replay buffer 𝒟 \FORepisode from 1 to number of episodes \STATEInitialize random processes 𝒩i for action exploration \STATEReceive initial joint observation 𝐨1 \FORtimestep t from 1 to episode length \STATESelect action ai=μi(oti)+𝒩ti for each agent \STATEExecute joint action 𝐚t and observe joint reward 𝐫t and new observation 𝐨t+1 \STATEStore transition (𝐬t,𝐚t,𝐫t,𝐬t+1) in 𝒟 \ENDFOR\STATESample a random minibatch of M transitions from 𝒟 \FOReach agent i \STATEEvaluate i and JPGi from Equations (1) and (2) \FOReach other agent j (ij) \STATEEvaluate JTSi,j from Equations (4) \STATEUpdate actor j with θjθj+αθθjλ2JTSi,j \ENDFOR\STATEUpdate critic with ϕiϕi+αϕϕii \STATEUpdate actor i with θiθi+αθθi(JPGi+λ1j=1NJTSi,j) \ENDFOR\STATEUpdate all target networks with Qi(1-τ)Qi+τQi and μi(1-τ)μi+τμi \ENDFOR


[H] \[email protected]@algorithmic CoachMADDPG \STATERandomly initialize N critic networks Qi, actor networks μi and one coach network pc \STATEInitialize N target networks Qi and μi \STATEInitialize one replay buffer 𝒟 \FORepisode from 1 to number of episodes \STATEInitialize random processes 𝒩i for action exploration \STATEReceive initial joint observation 𝐨1 \FORtimestep t from 1 to episode length \STATESelect action ai=μi(oti)+𝒩ti for each agent \STATEExecute joint action 𝐚t and observe joint reward 𝐫t and new observation 𝐨t+1 \STATEStore transition (𝐬t,𝐚t,𝐫t,𝐬t+1) in 𝒟 \ENDFOR\STATESample a random minibatch of M transitions from 𝒟 \FOReach agent i \STATEEvaluate i, JPGi, JEi and JEPGi from Equations (1), (2), (7) and (6) \STATEUpdate critic with ϕiϕi+αϕϕii \STATEUpdate actor with θiθi+αθθi(JPGi+λ2JEPGi+λ1JEi) \ENDFOR\STATEUpdate coach with ψψ+αψψ1Ni=1N(JEPGi+JEi) \STATEUpdate all target networks with Qi(1-τ)Qi+τQi and μi(1-τ)μi+τμi \ENDFOR

8.3 Hyper-parameter search results

Figure 6 shows the raw results of the hyper-parameter searches that are summarized in the main text (see Figure 3).

Figure 6: Hyper-parameter tuning results for all algorithms. There is one distribution per (algorithm, environment) pair, each one formed of 50 points (hyperparameter configuration sample). Each point represents the performance over 100 evaluation episodes for each of the 3 training seeds at the end of training for one sampled hyper-parameters configuration (total of 300 performance values per sampled configuration).

8.4 Hyper-parameter search ranges

Table 1 shows the ranges in which values for the hyper-parameters were drawn uniformly during the searches, where ηnoise represents the initial exploration noise scale. In our experiments, the learning rates of the agents (αθ) and the coach network (αψ) are always equal, motivated by the similar architectures and learning signals, and in order to reduce the search space. For the critic networks, we aim for greater flexibility, while constraining their learning rate αϕ to be related to αθ and αψ. We thus use a critic learning rate coefficient ωϕ so that αϕ=ωϕ*αθ.

Table 1: Ranges for hyper-parameter search
Hyper-parameter Range
log(αθ) [-8,-3]
log(ωϕ) [-2,     2]
log(τ) [-3,-1]
log(λ1) [-3,    0]
log(λ2) [-3,    0]
ηnoise [0.3,1.8]

8.5 Selected hyper-parameters

Tables 2, 3, 4, and 5 shows the best hyper-parameters found by the random searches for each of the environments and each of the algorithms.

Table 2: Best found hyper-parameters for the SPREAD environment
αθ 5.3*10-5 2.1*10-5 9.0*10-4 2.5*10-5 2.4*10-5
ωϕ 53 79 0.71 42 9.8
τ 0.05 0.083 0.076 0.098 0.019
λ1 - - - 0.054 0.21
λ2 - - - 0.29 0.46
ηnoise 1.0 0.5 0.7 1.2 1.2
Table 3: Best found hyper-parameters for the BOUNCE environment
αθ 8.1*10-4 3.8*10-5 1.2*10-4 1.3*10-5 2.0*10-5
ωϕ 2.4 87 0.47 85 55
τ 0.089 0.016 0.06 0.055 0.042
λ1 - - - 0.06 0.0027
λ2 - - - 0.0026 0.0024
ηnoise 1.2 0.9 1.2 1.0 1.4
Table 4: Best found hyper-parameters for the CHASE environment
αθ 4.5*10-4 2.0*10-4 9.7*10-4 1.3*10-5 9.5*10-4
ωϕ 32 64 0.79 85 7.3
τ 0.031 0.021 0.032 0.055 0.068
λ1 - - - 0.06 0.058
λ2 - - - 0.0026 0.017
ηnoise 0.6 1.0 1.5 1.0 1.2
Table 5: Best found hyper-parameters for the IMITATION environment
αθ 4.4*10-6 3.3*10-4 9.2*10-4 3.7*10-4 2.0*10-6
ωϕ 42 0.97 66 0.79 64
τ 0.04 0.0092 0.015 0.041 0.049
λ1 - - - 0.032 0.0044
λ2 - - - 0.0013 0.0014
ηnoise 1.6 1.1 1.0 0.4 1.8