Abstract
Multiagent reinforcement learning (MARL) is commonly considered to sufferfrom nonstationary environments and exponentially increasing policy space. Itwould be even more challenging when rewards are sparse and delayed over longtrajectories. In this paper, we study hierarchical deep MARL in cooperativemultiagent problems with sparse and delayed reward. With temporal abstraction,we decompose the problem into a hierarchy of different time scales andinvestigate how agents can learn highlevel coordination based on theindependent skills learned at the low level. Three hierarchical deep MARLarchitectures are proposed to learn hierarchical policies under different MARLparadigms. Besides, we propose a new experience replay mechanism to alleviatethe issue of the sparse transitions at the high level of abstraction and thenonstationarity of multiagent learning. We empirically demonstrate theeffectiveness of our approaches in two domains with extremely sparse feedback:(1) a variety of Multiagent Trash Collection tasks, and (2) a challengingonline mobile game, i.e., Fever Basketball Defense.
Quick Read (beta)
Hierarchical Deep Multiagent Reinforcement Learning with Temporal Abstraction
Abstract
Multiagent reinforcement learning (MARL) is commonly considered to suffer from nonstationary environments and exponentially increasing policy space. It would be even more challenging when rewards are sparse and delayed over long trajectories. In this paper, we study hierarchical deep MARL in cooperative multiagent problems with sparse and delayed reward. With temporal abstraction, we decompose the problem into a hierarchy of different time scales and investigate how agents can learn highlevel coordination based on the independent skills learned at the low level. Three hierarchical deep MARL architectures are proposed to learn hierarchical policies under different MARL paradigms. Besides, we propose a new experience replay mechanism to alleviate the issue of the sparse transitions at the high level of abstraction and the nonstationarity of multiagent learning. We empirically demonstrate the effectiveness of our approaches in two domains with extremely sparse feedback: (1) a variety of Multiagent Trash Collection tasks, and (2) a challenging online mobile game, i.e., Fever Basketball Defense.
Hierarchical Deep Multiagent Reinforcement Learning with Temporal Abstraction
Hongyao Tang^{1}, Jianye Hao^{1}, Tangjie Lv^{2}, Yingfeng Chen^{2}, Zongzhang Zhang^{3}, Hangtian Jia^{2}, Chunxu Ren^{2}, Yan Zheng^{1}, Zhaopeng Meng^{1}, Changjie Fan^{2}, Li Wang^{1} ^{1}College of Intelligence and Computing, Tianjin University, ^{2}Netease Fuxi AI Lab, ^{3}Nanjing University ^{1}{bluecontra,jianye.hao,yanzheng,mengzp,wangli}@tju.edu.cn, ^{2}{hzlvtangjie,chenyingfeng1,jiahangtian,renchunxu, fanchangjie}@corp.netease.com, ^{3}[email protected]
noticebox[b]Preprint. Under review.\[email protected]
1 Introduction
Deep Reinforcement Learning (DRL) has been applied to solve many challenging singleagent sequential decisionmaking problems in recent years levine2016end ; mnih2015human . However, many realworld tasks such as sensor networks zhang2013coordinating and autonomous cars Cao2013AnOO are naturally organized as multiagent systems (MASs), where multiple agents have cooperative or competitive interactions with others. In MASs, the problem complexity increases exponentially with the number of agents, thus making it infeasible to apply traditional RL approaches directly. Independent learning alleviates this issue but usually fails to coordinate since it suffers from the nonstationary environments due to multiagent policy updates.
To address such challenges, a large amount of works have studied deep MARL from a variety of perspectives. Foerster et al. foerster2017stabilising propose multiagent importance sampling and fingerprints to stabilize experience replay of independent Qleaning (IQL) tan1993multi . Peng et al. peng2017multiagent propose multiagent Bidirectionally Coordinated Network (BiCNet) to learn effective collaborative strategies from a centralized perspective in StarCraft Micromanagement tasks. Following the centralized training with decentralized execution paradigm Lowe2017MultiAgentAF , Counterfactual multiagent policy gradients foerster2017counterfactual are proposed to address the multiagent credit assignment issue. Besides, other approaches, e.g., CommNet sukhbaatar2016learning and DIAL foerster2016learning , are proposed to learn communication protocols among multiple cooperative agents.
Most previous works learn cooperative polices directly over primitive action spaces and usually perform well in environments with dense reward. However, in many realworld scenarios rewards are naturally sparse and immediate feedbacks are difficult to design manually. In such circumstances, it is challenging to achieve coordination and most plain MARL approaches can hardly learn effective policies from sparse and delayed feedback. To address this issue, one important direction is hierarchical MARL with temporal abstraction ghavamzadeh2006hierarchical , which was firstly proposed in tabular RL settings and is still neglected in deep MARL literature. One representative work in tabular settings is the multiagent MAXQ makar2001hierarchical . With temporal abstraction, the difficulties of learning with sparse reward are reduced by manually decomposing the original problem into multiple levels of abstractions. The results show that cooperative strategies can be efficiently learned at the high level based on the lowlevel skills, in simple Multiagent Trash Collection tasks and Automated Guided Vehicles (AGVs) scheduling tasks. However, the effectiveness of hierarchical MARL need to be further verified in more complex and practical problems with deep learning techniques.
In this paper, we study hierarchical MARL with deep neural networks in cooperative multiagent problems with sparse reward. We show that temporal abstraction can facilitate multiagent policy learning and leverage this to construct three hierarchical deep MARL architectures for different MARL learning paradigms. Besides, a new experience replay mechanism is introduced to further improve the coordination during the highlevel policy learning.
Key contributions of this work are summarized as follows:

•
To the best of our knowledge, we are the first to study hierarchical deep MARL with temporal abstraction. We model the problems with a twolevel hierarchy of abstraction, and study how cooperative policies and independent skills can be learned at different temporal scales together. To this end, we propose three hierarchical deep MARL architectures, i.e., hierarchical Independent Learner (hIL), hierarchical Communication network (hComm) and hierarchical Qmix network (hQmix), for different MARL paradigms.

•
We discuss two issues of hierarchical MARL, including the sparse transitions of highlevel learning and the nonstationarity of multiagent policy updates. To alleviate the above issues, we introduce Augmented Concurrent Experience Replay (ACER) through augmenting highlevel transitions with subtransitions and conducting experience replay concurrently.

•
We show the effectiveness of hierarchical deep MARL in a variety of extended classic Multiagent Trash Collection tasks. Moreover, our approaches achieve good performance in the challenging Fever Basketball Defense game, in which plain MARL approaches can hardly learn effective cooperative strategies due to sparse and delayed rewards.
2 Background
Markov Games
In this paper, we consider a multiagent extension of Markov decision processes (MDPs) called (partially observable) Markov games littman1994markov . A Markov game for $N$ agents is defined by a set of states $\mathcal{S}$, a set of primitive actions ${\{{\mathcal{A}}^{i}\}}_{i=0}^{N}$ and a set of observations ${\{{\mathcal{O}}^{i}\}}_{i=0}^{N}$ for each agent. Each agent $i$ receives a private observation ${o}^{i}:\mathcal{S}\to {\mathcal{O}}^{i}$ and selects actions through its policy ${\pi}^{i}:{\mathcal{O}}^{i}\times {\mathcal{A}}^{i}\to [0,1]$, producing the next state according to the state transition function $\mathcal{P}:\mathcal{S}\times {\mathcal{A}}^{1}\times \mathrm{\cdots}\times {\mathcal{A}}^{N}\times \mathcal{S}\to [0,1]$. Each agent $i$ obtains extrinsic rewards ${r}^{i}:\mathcal{S}\times {\mathcal{A}}^{1}\times \mathrm{\cdots}\times {\mathcal{A}}^{N}\to \mathbb{R}$ from the environment. Each agent $i$ aims to maximize its own total expected return ${\sum}_{t=0}^{T}{\gamma}^{t}{r}_{t}^{i}$, where $\gamma $ is a discount factor and $T$ is the time horizon. Especially, a Markov game is fullycooperative when all agents share a joint utility function, i.e., the reward function is identical for all agents.
Temporal Abstraction
Sparse and delayed feedback is a significant challenge in learning effective policies. In order to tackle this, we introduce temporal abstraction. Temporal abstraction is the key in solving longperiod tasks, since it reduces the difficulties of problem through allowing agents to ignore the details that are irrelevant for the task at hand Fikes1972learning . This is commonly seen in the context of Hierarchical Reinforcement Learning (HRL) kulkarni2016hierarchical ; Nachum2018Data ; parr1998reinforcement , where the tasks are usually decomposed into multilevel hierarchies and the agent operates and learns simultaneously at multiple levels of abstraction. Following the multiagent temporal abstraction literature ghavamzadeh2004learning ; makar2001hierarchical ; mehta2005multi , we consider a twolevel hierarchical model in a Markov game. As shown in Figure 1, each agent sets (multistep) intrinsic goals at the high level and a sequence of primitive actions are made at the low level to achieve the certain goal. We assume that the intrinsic goals are temporal abstraction of the simple tasks or skills that can be achieved or accomplished independently. This reduces the difficulties of the original problem while maintains the multiagent properties, thus agents can learn to cooperate and coordinate with others at the high level.
Hierarchical MARL
With multiagent temporal abstraction, we introduce hierarchical MARL as illustrated in 1. The high level of hierarchy can be modeled as a SemiMarkov game, similar to the Multiagent SemiMDP (MSMDP) ghavamzadeh2006hierarchical , since intrinsic goals may last for multiple time steps. Formally, each agent $i$ receives observation ${o}_{t}^{i}$ and chooses a goal ${g}_{t}^{i}\in {\mathcal{G}}^{i}$, where ${\mathcal{G}}^{i}$ denotes the set of all possible intrinsic goals. A new goal ${g}_{t+\tau}^{i}$ is selected until the current goal ${g}_{t}^{i}$ is achieved or terminated after $\tau $ steps of lowlevel executions. The next state ${s}_{t+\tau}$ is determined by the multistep transition function $\stackrel{~}{\mathcal{P}}:\mathcal{S}\times {\mathcal{A}}^{1}\times \mathrm{\cdots}\times {\mathcal{A}}^{N}\times \mathbb{N}\times \mathcal{S}\to [0,1]$, where $\mathbb{N}$ is the set of natural numbers. The objective of agent $i$’s highlevel policy ${\pi}^{i}$ is to maximize the cumulative extrinsic reward.
In contrast to the high level, we model the low level of hierarchy as MDPs. Agent $i$ receives the intrinsic observation ${\varphi}_{t}^{i}$ which depends on observation ${o}_{t}^{i}$ and current goal ${g}_{t}^{i}$, and chooses an action ${a}_{t}^{i}\in {\mathcal{A}}_{g}^{i}$, where ${\mathcal{A}}_{g}^{i}$ ($\subseteq {\mathcal{A}}^{i}$) is a set of all available primitive actions under the current goal. The intrinsic observation can be viewed as a form of state abstraction ghavamzadeh2006hierarchical which excludes the state features that are irrelevant with the current goal. Given an intrinsic goal, agent $i$ learns a lowlevel policy ${\pi}_{g}^{i}$ via optimizing the cumulative intrinsic reward. For example, the intrinsic reward ${\widehat{r}}_{t}^{i}$ can be $1$ if the goal is reached and $0$ otherwise kulkarni2016hierarchical .
2.1 Termination Models
It should be noted that the intrinsic goals of multiple agents may not terminate at the same time, thus inducing two termination models in hierarchical MARL, i.e., synchronous and asynchronous models ghavamzadeh2006hierarchical . In synchronous models, agents make decisions at the same time points and this can be seen in many scenarios where synchronization mechanisms or conventions exist. In such cases, it is suitable for highlevel communication and centralized learning if possible. However, synchronous models may result in policy suboptimality since agents have to wait or early terminate their lowlevel executions to meet the synchronization requirements ghavamzadeh2006hierarchical . This can be even severe with the increase of agents. In contrast, asynchronous termination models need no extra synchronization and decision making can take place in a fully decentralized fashion. This allows agents to learn more flexible policies while may cause potential difficulties in achieving coordination.
3 Approaches
In this section, based on the above modeling, we propose three architectures for hierarchical deep MARL and a new experience replay mechanism to facilitate highlevel learning.
3.1 Hierarchical Deep MARL Architectures
Hierarchical Independent Learner
The most straightforward way is to let each agent learn its hierarchical policies independently, inducing the hierarchical Independent Learner (hIL). As illustrated in Figure 2, each hIL agent updates its hierarchical policies independently and no gradient is propagated between highlevel and lowlevel policies. hIL can be implemented with either valuebased or policybased algorithms LillicrapHPHETS15continuous ; SchulmanWDRK17proximal ; van2016deep . In our experiments, we implement the hIL architecture with DQNs mnih2015human . With SemiMDP QLearning sutton1999between , we can train the highlevel policy $\pi $ parameterized by $\theta $ via minimizing the loss with minibatch samples from replay buffer $\mathcal{D}$,
$$\mathcal{L}(\theta )={\mathbb{E}}_{({s}_{t},{g}_{t},\tau ,{r}_{t:t+\tau 1},{s}_{t+\tau})\sim \mathcal{D}}\left[{\left({y}_{t}Q({s}_{t},{g}_{t};\theta )\right)}^{2}\right],$$  (1) 
where ${y}_{t}={\sum}_{k=0}^{\tau 1}{\gamma}^{k}{r}_{t+k}+{\gamma}^{\tau}{\mathrm{max}}_{{g}_{t+\tau}}Q({s}_{t+\tau},{g}_{t+\tau};{\theta}^{})$. Here ${\theta}^{}$ denotes the parameters of the target value function and superscripts of agents are omitted for clarity.
The hIL architecture is compatible with both synchronous and asynchronous termination models. In independent learning paradigm, each agent only cares about its local information and treats other agents as part of the environment. However, it does not make use of available global information, which can be significant for facilitating coordination among cooperative agents. To this end, we use hIL as the base architecture and we derive the following two architectures from hIL.
Hierarchical Communication Network
In the scenarios where each agent can have access to other agents’ information occasionally, we propose hierarchical Communication network (hComm) which allows agents to learn sparse highlevel communication to facilitate cooperation. Inspired by commNet sukhbaatar2016learning , which was originally designed to learn temporal communication along consecutive steps, hComm learns spatial communication between hidden layers among multiple agents. As illustrated in Figure 2, agent $i$ participates in the communication by feeding its hidden state ${h}_{t}^{i}$ into the Comm module and takes the processed output ${c}_{t}^{i}$ as additional inputs of the next layer. One simple way to implement the Comm module is to average over the hidden states of other agents, i.e.,
$${c}_{t}^{i}=\frac{1}{N1}\sum _{k\ne i}{h}_{t}^{k}.$$  (2) 
Intuitively, this allows each agent to explicitly reason about the information of others when choosing an intrinsic goal. Note that hComm naturally requires agents to learn and execute highlevel policies centrally. However, it is not necessary for agents to choose goals in synchronization and thus hComm is also compatible with both synchronous and asynchronous termination models.
Hierarchical Qmix Network
Since centralized control during online execution is usually difficult to maintain in practice, another popular way to facilitate coordination is to learn decentralized policies with centralized training. Following this paradigm, we propose hierarchical Qmix network (hQmix) which leverages the idea of the Qmix architecture rashid2018qmix to coordinate the highlevel policy updates during training. As illustrated in Figure 2, a feedforward mixing network $\mathcal{M}$ (green) takes highlevel action values ${\{{Q}^{i}\}}_{i=0}^{N}$ as inputs and mixes them monotonically, producing the joint value ${Q}^{tot}$:
${Q}^{tot}({\overrightarrow{o}}_{t},{\overrightarrow{g}}_{t})=$  $\mathcal{M}({Q}^{1}({o}_{t}^{1},{g}_{t}^{1}),\mathrm{\dots},{Q}^{N}({o}_{t}^{N},{g}_{t}^{N})),$  (3)  
$\frac{\partial {Q}^{tot}}{\partial {Q}^{i}}}\ge 0,\forall i\in N,$ 
where ${\overrightarrow{o}}_{t}$, ${\overrightarrow{g}}_{t}$ are the joint observation and intrinsic goal of agents. This ensures that a global argmax operation performed on ${Q}^{tot}$ yields the same result as a set of individual argmax performed on individual ${Q}^{i}$. The weights and biases of mixing network are produced by separate hypernetworks ha2017hypernetworks (the red part of Figure 2), which takes state ${s}_{t}$ as input. To enforce the monotonicity constraint in Equation 3, the weights are restricted to be nonnegative with an absolute activation function.
The hQmix architecture allows agents to train their highlevel policies via updating a joint actionvalue function. This is naturally compatible for synchronous termination model since ${Q}^{tot}$ is estimated over joint intrinsic goals. For asynchronous case, an additional trim may be necessary to align the highlevel transitions. An additional discussion can be found in the Supplementary Material.
Lowlevel Parameter Sharing
One advantage of our hierarchical model is that lowlevel policies are relatively independent and can be reused among agents. Therefore, we use parameter sharing in lowlevel learning from two aspects: 1) we share parameters across the lowlevel policy networks of an agent that have similar input and output formats, e.g., shooting and passing a football both use the move and kick actions; 2) lowlevel policy parameters are shared among cooperative agents. For specialization, the lowlevel policy takes a goal index and an agent identifier as additional inputs.
3.2 Augmented Concurrent Experience Replay
Experience replay is important for offpolicy learning to improve the crucial sample efficiency. However, in hierarchical MARL, experience replay can be flawed in two aspects. First, the highlevel transitions start and terminate at sparse time points. It causes inefficient updates of highlevel policy since the intermediate states (subtransitions) are not utilized. Second, highlevel experiences can be obsolete and misleading due to the nonstationary environment when replayed independently. To address the above two issues, we propose an experience replay mechanism named Augmented Concurrent Experience Replay (ACER), consisting of two components: experience augmentation and concurrent sampling.
As illustrated in Figure 3, we augmented the highlevel transition $\u27e8{o}_{t}^{i},{g}_{t}^{i},{r}_{t:t+\tau 1}^{i},{o}_{t+\tau}^{i}\u27e9$ with subtransitions (dashed braces), producing an augmented experience ${\{\u27e8{o}_{t+k}^{i},{g}_{t}^{i},{r}_{t+k:t+\tau 1}^{i},{o}_{t+\tau}^{i}\u27e9\}}_{k=0}^{\tau 1}$. This enables agents to learn highlevel policies more efficiently from denser experiences, since each state encountered between consecutive goals can be updated. In practice, to reduce the storage burden, we can conduct experience augmentation with selective subtransitions, e.g., interpolate at intervals.
Besides, to deal with the nonstationarity of highlevel policy learning, we adopt the idea of concurrent sampling omidshafiei2017deep in ACER. As demonstrated in Figure 3, in each episode, augmented experiences of agents are stored in sequence along the timestep axis, which ensures concurrent experiences are stored at the same position in the buffer. Instead of sampling experiences independently, ACER samples piles of concurrent experiences for all agents. Concurrent sampling induces correlations in policy updates of agents, thus encouraging agents towards coordinated policies and stabilizing the training process as well.
4 Experiments
In this section, we firstly demonstrate the effectiveness of hierarchical deep MARL in extended Multiagent Trash Collection tasks, and then evaluate our approaches in the challenging team sports game, i.e., Fever Basketball Defense.
4.1 Multiagent Trash Collection (MATC)
We devise a range of MATC tasks, by extending the classic game in makar2001hierarchical . In all three tasks of Figure 7, two agents (blue and red) are assigned to collect the cans (green) and dump them into trash bins (yellow) in a gridworld with impenetrable walls (grey). The observation of each agent is a imagelike tensor of multiple channels that record the views for different objects (i.e., agents, cans, trash bins and walls) in the environments. Each agent has 7 actions, including navigation actions (i.e., up, down, left, right and noop) and operation actions (i.e., pickup and putdown). For MATCRoom and MATCRing, agents obtain a reward of $0.5$ when a can is dumped into a trash bin. For MATCCoordination, agents receive a reward of $1/0.5$ if both of them dump the can into the same (upper/lower) trash bin. Otherwise, they obtain a reward of $0.1$ for dumping into different trash bins. The three tasks have a max horizon of $100/100/50$ steps respectively.
Temporal Abstraction
For all three tasks, we abstract the task into two onestep operation goals (i.e., pickup, putdown) and several navigation goals that are built over navigation actions. For example, the navigation goals of MATCRoom are movetocan1, movetocan2, movetotrashbin1 and movetotrashbin2. Each navigation goal has a max duration of 15 time steps and can end when the goal is reached. This induces an asynchronous termination model. Two agents share the same task decomposition. The lowlevel intrinsic observation is a concatenation of the relevant channels, i.e., agents, walls and the target object associated with the current goal. We use a binary intrinsic reward function for lowlevel policy learning, which gives 1 for reaching the goal and $0.01$ otherwise.
Result Analysis
To show the effectiveness of hierarchical deep MARL, we compare the hIL architecture with independent DQN (ILDQN), which can be viewed as a nonhierarchical version of hIL. The results are shown in Figure 4(b)4(d). In all three tasks, ILDQN can hardly achieve any positive reward, while hIL successfully accomplishes the tasks and achieves nearoptimal performance especially in MATCRoom and MATCCoordination. This indicates that it is difficult to learn coordinated behaviors over primitive actions with such sparse reward. Actually, we also apply other advanced techniques for ILDQN (e.g., prioritized experience replay schaul2015prioritized and SimHash exploration TangHFSCDSTA17Exploration ) while no apparent improvement is observed. In contrast, with hierarchical policy learning, intrinsic goals can be easily mastered at the low level. Further, cooperative strategies can be learned efficiently over the intrinsic goals at a large time scale. The results for hComm and hQmix are also omitted since hIL has already achieved nearoptimal performance in such simple domain. Experimental details are provided in the Supplementary Material.
4.2 Fever Basketball Defense (FBD)
Fever Basketball is a popular online mobile game that mimics the realworld street basketball scenarios. FBD is a miniscenario of Fever Basketball in which three defense players with different roles, i.e., Center (C), Small Forward (SF) and Point Guard (PG), compete against the offense team. A screenshot of FBD is shown in Figure 5. The agents (defense players) aim to learn cooperative defensive strategies to prevent the offense team from scoring. The offense players use a builtin policy, which is much more powerful than average human game players.
4.2.1 Environment Setup
Each episode renders an offense round with a max duration of 20 seconds. A reward of $1/+1$ is given at the end of each episode, depending on whether the offense team scores or not. The state is a feature vector consisting of the remaining time of the episode, the positions and types of the entities (i.e., teammates, opponents and the ball) in the game field. Each agent has 18 available actions, including noop, 8 move actions (i.e., move[in 8 directions]), and 9 defense actions (i.e., defense and defensemove[in 8 directions]). The defense actions, that have lower movement speed, can decrease the hit rate of the offense player nearby. Besides, a special builtin defensive behaviors, i.e., block, can be triggered occasionally when the defense agent is in a good defense position. Therefore, the purpose of the agents is to keep good defense positions and prevent the offense team from scoring.
Temporal Abstraction
As illustrated in Figure 9, we decompose the game into a twolevel hierarchy with 6 intrinsic goals (i.e., runto[C/SF/PG] and closedefense[C/SF/PG]) for all three agents, each of which has a set of 9 primitive actions. Each goal is executed for up to 15 steps and early terminations can only happen when an episode ends or the offense players pass the ball. This induces a synchronous termination model since agents alter their goals concurrently (the asynchronous setting is discussed in Section 4.2.3). An agent’s lowlevel intrinsic observation contains the only the information about itself and the target offense player with the specific role associated to the current goal. To train lowlevel policies, we use a binary intrinsic reward function for all goals. A reward of $+2$/$0.01$ is given depending on whether the agent is within the Effective Defense Area, i.e., a sector area in front of the offensive agent. For the closedefense goals, an extra bonus of $+2$ is given if an agent successfully blocks the ball. Complete description can be found in the Supplementary Material.
4.2.2 Evaluations
We evaluate our approaches with two criteria: the missshot rate of the offensive team and the blockshot rate of the defensive team over recent 100 episodes. We firstly conduct experiments with our hierarchical deep MARL architectures and two benchmark approaches, i.e., ILDQN and LowLevelOnly. LowLevelOnly is a variant of hIL that trains lowlevel policies only and leaves highlevel policy random. We observe that CommNet and Qmix show no better performance than ILDQN and thus are omitted for clarity. We suggest that the primary challenge for plain deep MARL approaches is the ineffective policy learning due to the extremely sparse reward.
Figure 6 shows the results for different approaches. First, the significance of temporal abstraction can be demonstrated by hIL’s superior performance over ILDQN. Besides, LowLevelOnly also shows certain defense performance even with a random highlevel policy. This is because that actions made by welllearned lowlevel policies still hinder the offense team. This indicates that hIL takes a twostep improvement via learning at different time scales. Lowlevel learning is the cornerstone for effective hierarchical learning. With highquality lowlevel policies, agents can learn defense strategies efficiently at the high level, thus further elevate the defense performance.
Furthermore, hComm and hQmix achieve comparative results and both of them clearly outperform the base architecture hIL. For hComm, we credit the advantage to the highlevel communication among agents. With explicitly taking others’ information into consideration, agents can make more informed decisions and update their policies in a more coordinated manner. Thus, in contrast to hIL, it is more likely for cooperative policies to emerge with the hComm architecture. For hQmix, a joint actionvalue function is trained centrally with regard to the joint rewards. This provides individual agents with more accurate update signals while hIL suffers from the misleading of the joint reward. Interestingly, we observe that hComm and hQmix show different defense tactics. hComm prefers joint defense and rapid shifts, while hQmix tends to onetoone defense. This explains the results of hComm and hQmix in Figure 6. Joint defense strategy are more aggressive thus results in a higher blockshot rate, while may leave some offense player unguarded. Onetoone defense ensures good defense quality and leads to a comparative missshot rate in spite of a lower block rate. The demonstration videos can be seen in the Supplementary Material.
Methods  Blockshot Rate  Missshot Rate 

hIL  $0.27$  $0.50$ 
hILACER  $0.36$  $0.55$ 
hComm  $0.34$  $0.55$ 
hCommACER  $0.37$  $0.58$ 
Next we move to the evaluation of ACER, as shown in Table 1. The results show that ACER improves the defensive performance for both hIL and hComm. hILACER shows an significant advantage over hIL and achieves comparable results with hComm. In contrast, hCommACER shows a slight improvement for hComm and we hypothesize that it is due to the stability nature of communication and centralized learning. This indicates that ACER is effective for stabilizing experience replay and improving highlevel policy learning, especially when agents learn their policies independently. Note that we do not apply ACER on hQmix since hQmix naturally ensures concurrent sampling by updating policies with the joint experiences of agents.
4.2.3 Asynchronous Termination Setting
We also performed experiments with an asynchronous termination model by allowing agents to choose a new goal as long as it reaches/leaves the Effective Defense Area during the execution of runto/closedefense goals. We found similar results as the synchronous case but with a $3\%$ to $5\%$ performance decay. We hypothesize that it is due to the nonstationary nature of asynchronous model. Complete results are omitted due to space limitation and can be found in the Supplementary Material.
5 Conclusion
To the best of our knowledge, we are the first to study hierarchical deep MARL with temporal abstraction. In order to learn hierarchical policies efficiently with sparse rewards, we propose three architectures, i.e., hIL, hComm and hQmix, and ACER. Our experimental results show that, with our approaches, agents can learn policies of different time scales together and coordination can be effectively achieved at a large time scale based on the skills mastered at the low level. Future work will investigate the specialization for asynchronous hierarchical MARL, the extension to competitive domains, the automatic task decomposition of multiagent problems.
References
 [1] Y. Cao, W. Yu, W. Ren, and G. Chen. An overview of recent progress in the study of distributed multiagent coordination. IEEE Transactions on Industrial Informatics, 9:427–438, 2013.
 [2] R. Fikes, P. E. Hart, and N. J. Nilsson. Learning and executing generalized robot plans. Artificial Intelligence, 3(13):251–288, 1972.
 [3] J. N. Foerster, Y. M. Assael, N. Freitas, and S. Whiteson. Learning to communicate with deep multiagent reinforcement learning. In NIPS, pages 2137–2145, 2016.
 [4] J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson. Counterfactual multiagent policy gradients. In AAAI, pages 1–8, 2018.
 [5] J. N. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. S. Torr, P. Kohli, and S. Whiteson. Stabilising experience replay for deep multiagent reinforcement learning. In ICML, pages 1146–1155, 2017.
 [6] M. Ghavamzadeh and S. Mahadevan. Learning to communicate and act using hierarchical reinforcement learning. In AAMAS, pages 1114–1121, 2004.
 [7] M. Ghavamzadeh, S. Mahadevan, and R. Makar. Hierarchical multiagent reinforcement learning. Journal of Autonomous Agents and MultiAgent Systems, 13(2):197–229, 2006.
 [8] D. Ha, A. M. Dai, and Q. V. Le. Hypernetworks. In ICLR, pages 1–8, 2017.
 [9] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In NIPS, pages 3675–3683, 2016.
 [10] S. Levine, C. Finn, T. Darrell, and P. Abbeel. Endtoend training of deep visuomotor policies. Journal of Machine Learning Research, 17(39):1–40, 2016.
 [11] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015.
 [12] M. L. Littman. Markov games as a framework for multiagent reinforcement learning. In ICML, pages 157–163. 1994.
 [13] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch. Multiagent actorcritic for mixed cooperativecompetitive environments. In NIPS, pages 6382–6393, 2017.
 [14] R. Makar, S. Mahadevan, and M. Ghavamzadeh. Hierarchical multiagent reinforcement learning. In ICAA, pages 246–253, 2001.
 [15] N. Mehta and P. Tadepalli. Multiagent shared hierarchy reinforcement learning. In ICML Workshop on Rich Representations for Reinforcement Learning, pages 45–50, 2005.
 [16] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 [17] O. Nachum, S. Gu, H. Lee, and S. Levine. Dataefficient hierarchical reinforcement learning. CoRR, abs/1805.08296, 2018.
 [18] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian. Deep decentralized multitask multiagent reinforcement learning under partial observability. In ICML, pages 2681–2690, 2017.
 [19] R. Parr and S. J. Russell. Reinforcement learning with hierarchies of machines. In NIPS, pages 1043–1049, 1997.
 [20] P. Peng, Y. Wen, Y. Yang, Q. Yuan, Z. Tang, H. Long, and J. Wang. Multiagent bidirectionallycoordinated nets: Emergence of humanlevel coordination in learning to play starcraft combat games. CoRR, abs/1703.10069, 2017.
 [21] T. Rashid, M. Samvelyan, C. Schröder de Witt, G. Farquhar, J. N. Foerster, and S. Whiteson. QMIX: monotonic value function factorisation for deep multiagent reinforcement learning. In ICML, pages 4292–4301, 2018.
 [22] T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. CoRR, abs/1511.05952, 2015.
 [23] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
 [24] S. Sukhbaatar, A. Szlam, and R. Fergus. Learning multiagent communication with backpropagation. In NIPS, pages 2244–2252, 2016.
 [25] R. S. Sutton, D. Precup, and S. P. Singh. Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211, 1999.
 [26] M. Tan. Multiagent reinforcement learning: Independent vs. cooperative agents. In ICML, pages 330–337, 1993.
 [27] H. Tang, R. Houthooft, D. Foote, A. Stooke, X. Chen, Y. Duan, J. Schulman, F. D. Turck, and P. Abbeel. #Exploration: A study of countbased exploration for deep reinforcement learning. In NIPS, pages 2750–2759, 2017.
 [28] H. v. Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double Qlearning. In AAAI, pages 2094–2100, 2016.
 [29] C. Zhang and V. R. Lesser. Coordinating multiagent reinforcement learning with limited communication. In AAMAS, pages 1101–1108, 2013.
A. Multiagent Trash Collection (MATC)
A.1. Environment Introduction
For all three domains in Figure 1, the two agents (blue and red squares) are assigned to collect the cans (green squares) and dump them into trash bin (yellow square) with impermeable walls (grey squares). Each agent has 7 actions, i.e., up, down, left, right, pickup, putdown and noop. All cans and trash bins are equivalent and 7 actions are always available, which means that agents can pick up and put down cans at any time. Nothing happens if an illegal action is chosen, e.g., walk into walls, pick up or put down nothing, and pick up when currently possessing.
A.2. Temporal Abstraction
One example of temporal abstraction for MATCRoom is shown in Figure 8. Agents can alter their goals only when the current goal is achieved or a maximum goal length is reached. In our experiments, we set the max goal length to be $15$ steps. The max episode horizon is set to $100/100/50$ steps for MATCRoom, MATCRing and MATCCoordination tasks respectively.
A.3. Experimental Details and Hyperparameters
For hIL, the highlevel and lowlevel $Q$networks are all 2layer MultiLayer Perceptrons (MLPs) with $64/64$ units. The learning rates are $0.00025/0.0005$ and the discount factors are $0.95/0.9$ for highlevel and lowlevel learning. The memory buffer sizes are $5000/5000$ for both high level and low level. At the beginning of training, we keep highlevel policy random and update lowlevel policies for $50000$ times. After the warmup period, agents learn their highlevel policies and lowlevel policies together. For ILDQN, the Qnetworks are 2layer MLPs with $128/128$ units. The learning rates are $0.00025$ and the discount factors are $0.99$. The memory buffer sizes are $10000$.
We conduct network updates every $20$ steps. For hIL, the exploration rates are annealed from $1$ to $0.1$ in $10000/100000$ updates for lowlevel/highlevel learning. For ILDQN, the exploration rates are annealed from $1$ to $0.1$ in $100000$ updates. In MATC experiments, we use parameter sharing for both highlevel and lowlevel policies.
B. Fever Basketball Defense (FBD)
B.1. Task Decomposition
An illustration of decomposition for Fever Basketball Defense is shown in Figure 9.
B.2. State and Intrinsic Observation
As illustrated in Figure 9, the position of an unit, i.e., defense/offense players or the ball, is a tuple of $\u27e8d,\mathrm{cos}(\theta ),\mathrm{sin}(\theta )\u27e9$, where $d$ is the distance from the player to the basket rim and $\theta $ is the angle from the horizontal axis and the line between the player and the basket rim.
The highlevel learner and lowlevel learner receive different input features. The state is $50$dimension vector of features, consisting of the remaining time of the episode, the position of the ball, the types and positions of itself, teammates and opponents. With state abstraction, the intrinsic observation at the low level is a $26$dimension vector of features consisting of the types and positions of itself and the defense target of the current goal, e.g., the defense target is C if the current goal is runtoC.
B.3. Effective Defense Area
The effective defense area is defined as a sector area when an agent has a defense target. For example, if player $2$ is the current defense target of player $1$, we say that player $1$ is in the effective defense area when the distance between them is less than $1.5$ meters and the difference of their angles to the basket rim is less than $7.5$ degrees, i.e., $$.
B.4. Experimental Details and Hyperparameters
For hIL, the highlevel and lowlevel Qnetworks are all 2layer MLPs with $128/64$ and $64/32$ units respectively. The learning rates are $0.00025/0.00025$ and the discount factors are $0.95/0.9$ for highlevel and lowlevel learning. The memory buffer sizes are $20000$ for both high level and low level. We build our hComm and hQmix architecture based on the hIL architecture with the same hyperparameters. For hComm, we implement the Comm module between two hidden layers of the highlevel networks. For hQmix, we use the same network design for mixing networks and hypernetworks as the original paper.
We update networks every $1/2$ steps for lowlevel and highlevel learning respectively. The exploration rates are annealed from $1/1$ to $0.01/0.1$ in $100000/20000$ updates for lowlevel and highlevel learning respectively. In FBD experiments, we use parameter sharing for lowlevel networks only. At the beginning of training, we keep highlevel policy random and update lowlevel policies for $200000$ times. After the warmup period, agents learn their highlevel policies and lowlevel policies together.
C. Additional Discussion
C.1. Experience Trim for hQmix with Asynchronous Termination Model
Since the highlevel learning of hQmix follows the centralized training and decentralized execution paradigm, the hQmix architecture may not be directly applicable in the asynchronous termination setting.
Concretely, consider the update process of agents’ highlevel $Q$networks with hQmix:
${Q}^{i}({\overrightarrow{o}}_{t},{\overrightarrow{g}}_{t})={Q}^{i}({\overrightarrow{o}}_{t},{g}_{t}^{1},\mathrm{\dots},{g}_{t}^{N})\leftarrow $  $(1\alpha ){Q}^{i}({\overrightarrow{o}}_{t},{g}_{t}^{1},\mathrm{\dots},{g}_{t}^{N})$  (4)  
$+\alpha \left[{\displaystyle \sum _{k=0}^{\tau 1}}{\gamma}^{k}{r}_{t+k}^{i}+{\gamma}^{\tau}\underset{{g}_{t+\tau}^{i}}{\mathrm{max}}{Q}^{i}({\overrightarrow{o}}_{t+\tau},{g}_{t+\tau}^{1},\mathrm{\dots},{g}_{t+\tau}^{N})\right].$ 
Since multiple agents may choose and terminate their highlevel decisions (i.e., intrinsic goals) at different time points, this induces that other agents could make several intrinsic goals during the execution of ${g}_{t}^{i}$.
For example, agent 1 chooses an intrinsic goal at step 0, and terminates the goal at step 9. Meanwhile, agent 2 also chooses an intrinsic goal at step 0 and terminates the goal at step 5, then chooses another goal and terminates the second goal at step 9. Thus, there are one highlevel transition for agent 1 and two highlevel transitions for agent 2, resulting in an invalid update for $({\overrightarrow{o}}_{0},{g}_{0}^{1},{g}_{0}^{2},{r}_{0:8},{\overrightarrow{o}}_{9})$ with Equation 4.
One possible approach is the experience trim, e.g., trim the original transition of agent 1 and agent 2 into $({\overrightarrow{o}}_{0},{g}_{0}^{1},{g}_{0}^{2},{r}_{0:4},{\overrightarrow{o}}_{5})$ and $({\overrightarrow{o}}_{5},{g}_{0}^{1},{g}_{5}^{2},{r}_{5:8},{\overrightarrow{o}}_{9})$. This aligns the experiences through cut the transition of agent 1 into two segments. Now the update of $Q$functions can be carried out with Equation 4.
D. Complete Results
Synchronous  Asynchronous  
Approach  Blockshot Rate  Missshot Rate  Blockshot Rate  Missshot Rate 
ILDQN  $0.02$  $0.25$  $0.02$  $0.25$ 
LowLevelOnly  $0.16$  $0.43$  $0.16$  $0.43$ 
hIL  $0.27$  $0.50$  $0.24$  $0.48$ 
hComm  $0.34$  $0.55$  $0.30$  $0.53$ 
hQmix  $0.30$  $0.55$  N/A  N/A 
hILACER  $0.36$  $0.55$  0.31  $0.52$ 
hCommACER  0.37  0.58  0.31  0.54 
E. Demonstration Videos
We provide two demonstration videos for hQmix and hComm respectively, i.e., hQmixdemo.mp4 and hCommdemo.mp4.
In the videos, our agents always control the team that is currently doing defense, i.e., control team A if A is doing defense and control team B when team A gets the ball and team B starts to defense.
We can observe that hComm and hQmix learns different defense strategies. hQmix agents prefer to onetoone defense and almost always keep good defense positions. Besides, local cooperative defense can also been observed which means team spirits emerge among hQmix agents. The behaviors are quite reasonable as commonly seen in realworld basketball matches. In contrast, hComm learns relatively aggressive strategies. The agents show rapid shifts in defense target and joint defense behaviors occur a lot. In two aspects, this improves the quality of local defense, thus resulting in a higher blockshot rate, but may cause the issue of leaving some offense player unguarded.