Abstract
In cooperative multiagent reinforcement learning, agents commonly acquireheterogeneous knowledge. Learning across the team can be greatly improved ifagents can effectively exchange their knowledge to other agents. In particular,recent work showed that action advising, a form of peertopeer knowledgetransfer from teacher agents to student agents, improves teamwide learning.However, that prior work on action advising only considered advising withprimitive (lowlevel) actions, which limits scalability. This paper introducesa novel learningtoteach framework, called hierarchical multiagent teaching(HMAT), in which the teacher advice may include extended action sequences overmultiple levels of temporal abstraction. The empirical evaluations show thatHMAT accelerates teamwide learning progress in environments that are morecomplex than considered in previous learningtoteach research. HMAT is alsoshown to learn teaching policies that can be transferred to differentteammates/tasks, even when teammates have heterogeneous action spaces.
Quick Read (beta)
Learning Hierarchical Teaching in Cooperative Multiagent Reinforcement Learning
Abstract
In cooperative multiagent reinforcement learning, agents commonly acquire heterogeneous knowledge. Learning across the team can be greatly improved if agents can effectively exchange their knowledge to other agents. In particular, recent work showed that action advising, a form of peertopeer knowledge transfer from teacher agents to student agents, improves teamwide learning. However, that prior work on action advising only considered advising with primitive (lowlevel) actions, which limits scalability. This paper introduces a novel learningtoteach framework, called hierarchical multiagent teaching (HMAT), in which the teacher advice may include extended action sequences over multiple levels of temporal abstraction. The empirical evaluations show that HMAT accelerates teamwide learning progress in environments that are more complex than considered in previous learningtoteach research. HMAT is also shown to learn teaching policies that can be transferred to different teammates/tasks, even when teammates have heterogeneous action spaces.
figureFigureFigures \crefnamefigureFigureFigures \crefnametableTableTables \crefformattableTable #1 \crefformatequation(#1) \Crefformatequation(#1) \crefformatalgorithmAlgorithm #1 \CrefformatappendixAppendix #1 \crefformatappendixAppendix #1 \usetikzlibrarydecorations.pathreplacing,calc \renewrobustcmd \renewrobustcmd \newrobustcmd\B
Learning Hierarchical Teaching in Cooperative Multiagent Reinforcement Learning
DongKi Kim^{†}^{†}thanks: LIDS, MIT^{,}^{†}^{†}thanks: MITIBM Watson AI Lab [email protected] Miao Liu^{2}^{2}footnotemark: 2 ^{,}^{†}^{†}thanks: IBM Research [email protected] Shayegan Omidshafiei^{1}^{1}footnotemark: 1 ^{,}^{2}^{2}footnotemark: 2 ^{,}^{†}^{†}thanks: Work was completed while at MIT [email protected] Sebastian LopezCot^{1}^{1}footnotemark: 1 ^{,}^{2}^{2}footnotemark: 2 [email protected] Matthew Riemer^{2}^{2}footnotemark: 2 ^{,}^{3}^{3}footnotemark: 3 [email protected] Golnaz Habibi^{1}^{1}footnotemark: 1 ^{,}^{2}^{2}footnotemark: 2 [email protected] Gerald Tesauro^{2}^{2}footnotemark: 2 ^{,}^{3}^{3}footnotemark: 3 [email protected] Sami Mourad^{2}^{2}footnotemark: 2 ^{,}^{3}^{3}footnotemark: 3 [email protected] Murray Campbell^{2}^{2}footnotemark: 2 ^{,}^{3}^{3}footnotemark: 3 [email protected] Jonathan P. How^{1}^{1}footnotemark: 1 ^{,}^{2}^{2}footnotemark: 2 [email protected]
noticebox[b]Preprint. Under review.\[email protected]
1 Introduction
In cooperative multiagent reinforcement learning (MARL), agents commonly develop their own knowledge of a domain since they have unique experiences. The history of human social groups provides evidence that the collective intelligence of multiagent populations may be greatly boosted if agents share their learned behaviors with others [26]. With this motivation in mind, we explore new methodologies for allowing multiagent populations of deep RL agents to effectively share their knowledge and learn from other agents while maximizing collective reward.
Recently proposed frameworks allow for various types of knowledge transfer between agents [30]. In this paper, we focus on transfer based on action advising, where an experienced “teacher” agent helps a less experienced “student” agent, by suggesting which action to take next. Action advising allows a student to directly execute suggested actions without incurring much computation overhead. Recent work on action advising includes the Learning to Coordinate and Teach Reinforcement (LeCTR) framework [23], in which agents learn when and what actions to advise. While LeCTR improves upon prior advising methods based on heuristics [1, 4, 5, 31], it faces limitations in scaling to more complicated tasks with highdimension stateaction spaces, long time horizons, and delayed rewards. The key difficulty is teacher credit assignment [23]: learning teacher policies requires estimates of the impact of each piece of advice on the student agent’s learning progress, but these estimates are difficult to obtain. A simple function approximation, such as tile coding, has been used to simplify the learning of teaching policies in LeCTR, but that approach does not scale well.
This paper proposes a new learningtoteach framework, hierarchical multiagent teaching (HMAT). Specifically, scalability is improved by representing student policies using nonlinear function approximations (e.g., deep neural networks (DNNs)) and hierarchical reinforcement learning (HRL) [29, 14, 21], which allows advising temporally extended sequences of primitive actions. Credit assignment remains a significant issue since DNNs use minibatches to stabilize learning [11], and minibatches can be randomly selected from a replay memory [20]. Hence, the student’s learning progress is affected by a batch of advice suggested at varying times and identifying the extent to which each piece of advice contributes to the student’s learning is very challenging. Additional challenges include handling large stateaction spaces, long time horizons, and delayed rewards.
The main contribution of this work is a method to resolve the teacher credit assignment with deep student policies: a sequence of advice (i.e., subgoal plan), instead of single advice, is used to update the student’s policy and a temporary student policy is used to estimate the amount of teacher’s contribution to the student’s learning progress. The second contribution is learning (via HRL) highlevel teacher policies that advise students to take highlevel actions (e.g., subgoals). Empirical evaluations demonstrate multiple advantages of HMAT. First, HMAT accelerates the teamwide learning progress for complex tasks compared to previous techniques such as LeCTR. This benefit results from the improved credit assignment determination that helps the teacher learning process; the representation of student policies using DNNs that can handle large stateaction spaces; and the highlevel advising that addresses longhorizons and delayed rewards. Agents can also learn highlevel teaching policies that are transferable to different types of agents and/or tasks. Finally, agents can have different dynamics/action spaces because the highlevel teaching enables knowledge transfer that is agnostic to these details.
2 Background
We consider a cooperative MARL setting in which $n$ agents jointly interact in the environment, then receive feedback via local observations and a shared team reward. This setting can be formalized as a Decentralized Partially Observable Markov Decision Process (DecPOMDP), defined as a tuple $\u27e8\mathcal{I},\mathcal{S},\U0001d4d0,\mathcal{T},\mathbf{\Omega},\mathcal{O},\mathcal{R},\gamma \u27e9$ [22]; $\mathcal{I}=\{1,\text{\u2026},n\}$ is the set of agents, $\mathcal{S}$ is the set of states, $\U0001d4d0={\times}_{i\in \mathcal{I}}{\mathcal{A}}^{i}$ is the set of joint actions, $\mathcal{T}$ is the transition probability function, $\mathbf{\Omega}={\times}_{i\in \mathcal{I}}{\mathrm{\Omega}}^{i}$ is the set of joint observations, $\mathcal{O}$ is the observation probability function, $\mathcal{R}$ is the reward function, and $\gamma \in [0,1)$ is the discount factor. At each timestep $t$, each agent $i$ executes an action according to its policy ${a}_{t}^{i}\sim {\pi}^{i}({h}_{t}^{i};{\theta}^{i})$ parameterized by ${\theta}^{i}$, where ${h}_{t}^{i}=({o}_{1}^{i},\mathrm{\dots},{o}_{t}^{i})$ is the agent $i$’s observation history at timestep $t$. Joint action ${\bm{a}}_{t}=\{{a}_{t}^{1},\text{\u2026},{a}_{t}^{n}\}$ yields a transition from current state ${s}_{t}\in \mathcal{S}$ to next state ${s}_{t+1}\in \mathcal{S}$ with probability $\mathcal{T}({s}_{t+1}{s}_{t},{\bm{a}}_{\bm{t}})$. Then, a joint observation ${\bm{o}}_{t+1}=\{{o}_{t+1}^{1},\text{\u2026},{o}_{t+1}^{n}\}$ is obtained and the team receives a shared reward ${r}_{t}=\mathcal{R}({s}_{t},{\bm{a}}_{t})$. The agents’ objective is to maximize the expected cumulative reward $\mathbb{E}[{\sum}_{t}{\gamma}^{t}{r}_{t}]$. The policy parameter will often be omitted and reactive policies are considered (e.g., ${\pi}^{i}({h}_{t}^{i};{\theta}^{i})\equiv {\pi}^{i}({o}_{t}^{i})$).
2.1 Learning to Teach in Cooperative MARL
We review key concepts and notations in the learningtoteach framework.
TaskLevel Learning Problem We consider a cooperative MARL setting with two agents $i$ and $j$ in a shared environment. At each learning iteration, agents interact in the environment, collect experiences, and update their policies, ${\pi}^{i}$ and ${\pi}^{j}$, with learning algorithms, ${\mathbb{L}}^{i}$ and ${\mathbb{L}}^{j}$. The resulting policies aim to coordinate and optimize final task performance. This problem of learning taskrelated policies is referred to as the tasklevel learning problem ${\mathcal{P}}_{\text{Task}}$.
AdviceLevel Learning Problem Throughout tasklevel learning, even without assuming that any agents are experts, agents may still develop unique skills from their experiences. As such, it is potentially beneficial for agents to advise one another using their specialized knowledge to improve the final performance and accelerate teamwide learning. The problem of learning teacher policies that decide when and what to advise is referred to as the advicelevel learning problem, ${\stackrel{~}{\mathcal{P}}}_{\text{Advise}}$, where $\stackrel{~}{(\cdot )}$ refers to a teacher policy property.
Learning tasklevel policies and advicelevel policies are both RL problems, interleaved within the learningtoteach framework. However, there are important differences between ${\mathcal{P}}_{\text{Task}}$ and ${\stackrel{~}{\mathcal{P}}}_{\text{Advise}}$. One difference lies in the definition of learning episodes as rewards about the success of advice are naturally delayed relative to typical tasklevel rewards. For ${\mathcal{P}}_{\text{Task}}$, an episode terminates either when agents arrive at a terminal state or a timestep $t$ exceeds a prespecified value $T$. In contrast, for ${\stackrel{~}{\mathcal{P}}}_{\text{Advise}}$, an episode ends when tasklevel policies have converged, forming one “episode” for learning teaching policies. Upon completion of the advisinglevel episode, tasklevel policies are reinitialized and training proceeds for another advisinglevel episode. To avoid confusion, we refer to an episode as one tasklevel problem episode and a session as one advicelevel problem episode (see creftype 0(a)). Also, ${\mathcal{P}}_{\text{Task}}$ and ${\stackrel{~}{\mathcal{P}}}_{\text{Advise}}$ have different learning objectives. The tasklevel learning aims to coordinate and maximize cumulative reward per episode, whereas advicelevel learning aims to maximize cumulative teacher reward per session, corresponding to accelerating teamwide learning progress (i.e., a maximum area under the learning curve in one session). Lastly, tasklevel policies are inherently offpolicy while teacher policies are not necessarily. This is because tasklevel policies are updated with experiences affected by teacher policies, instead of experiences generated by following agents’ tasklevel policies alone.
2.2 Hierarchical Reinforcement Learning (HRL)
HRL is a structured framework with multilevel reasoning and extended temporal abstraction [24, 29, 8, 14, 2, 33, 25, 7, 34]. HRL efficiently decomposes a complex problem into simpler subproblems, which offers a benefit over nonHRL in solving difficult tasks with long horizons and delayed reward assignments. The closest HRL framework that we use in this paper is that of [21] with a twolayer hierarchical structure: the higherlevel manager policy (${\pi}_{\mathcal{M}}$) and the lowerlevel worker policy (${\pi}_{\mathcal{W}}$). The manager policy obtains an observation ${o}_{t}$ and plans a highlevel subgoal ${g}_{t}\sim {\pi}_{\mathcal{M}}({o}_{t})$ for the worker policy. The worker policy attempts to reach this subgoal from current state by executing a primitive action ${a}_{t}\sim {\pi}_{\mathcal{W}}({o}_{t},{g}_{t})$ in the environment. Following this framework, an updated subgoal is generated by the manager every $H$ timesteps and a sequence of primitive actions are executed by the worker. The manager learns to accomplish a task by optimizing the cumulative environment reward and stores an experience $\u27e8{o}_{t},{g}_{t},{\sum}_{t}^{t+H1}{r}_{t},{o}_{t+H}\u27e9$ every $H$ timesteps. By contrast, the worker learns to reach the subgoal by maximizing the cumulative intrinsic reward ${r}_{t}^{\text{intrinsic}}$ and stores an experience $\u27e8{o}_{t},{a}_{t},{r}_{t}^{\text{intrinsic}},{o}_{t+1}\u27e9$ at each timestep. Without loss of generality, we also denote the next observation with the prime symbol (${o}^{\prime}$).
3 Overview of Hierarchical Multiagent Teaching (HMAT)
3.1 Deep hierarchical TaskLevel Policy
HMAT addresses the limited scalability of LeCTR by using deep function approximations and learns highlevel teacher policies to decide what highlevel actions (e.g., subgoal) to advise fellow agents and when that advice should be given. Specifically, to extend tasklevel policies with DNNs and hierarchical representations, we replace ${\pi}^{i}$ and ${\pi}^{j}$ with deep hierarchical policies consisting of manager policies, ${\pi}_{\mathcal{M}}^{i}$ and ${\pi}_{\mathcal{M}}^{j}$, and worker policies, ${\pi}_{\mathcal{W}}^{i}$ and ${\pi}_{\mathcal{W}}^{j}$ (see creftype 0(b)). The manager and worker policies in HMAT are trained with different objectives. Managers learn to accomplish a task together (i.e., solving ${\mathcal{P}}_{\text{Task}}$) by optimizing cumulative reward, while workers are trained to reach subgoals suggested by their managers (see creftype C for details).
In this paper, we focus on transferring knowledge at the managerlevel instead of the workerlevel, since manager policies represent abstract knowledge, which is more relevant to fellow agents. Such a consideration also allows managers to transfer knowledge even if workers have dynamic/action/observation space unique to each agent. Therefore, hereafter when we discuss tasklevel polices it is implied that we are only discussing the manager policy. The manager subscript $\mathcal{M}$ will often be omitted when discussing these tasklevel policies (i.e., ${\pi}^{i}\equiv {\pi}_{\mathcal{M}}^{i}$) to simplify notation.
3.2 AdviceLevel Learning in Hierarchical Settings
We consider the problem of sharing knowledge between managers via teacher policies, ${\stackrel{~}{\pi}}^{i}$ and ${\stackrel{~}{\pi}}^{j}$, which learn when and what subgoal actions to advise. Consider creftype 0(b), where agents are learning to coordinate with hierarchical tasklevel policies (solving ${\mathcal{P}}_{\text{Task}}$) while advising each other via teacher policies (solving ${\stackrel{~}{\mathcal{P}}}_{\text{Advise}}$). There are two roles in creftype 0(b): that of a student agent $j$ (i.e., an agent whose manager policy receives advice) and that of a teacher agent $i$ (i.e., an agent whose teacher policy gives advice). Note that agents $i$ and $j$ can simultaneously teach each other, but, for clarity, creftype 0(b) only shows a oneway interaction. Here, student $j$ has decided that it is appropriate to strive for subgoal ${g}^{j}$ by querying its manager policy. Before $j$ passes the subgoal ${g}^{j}$ to its worker, $i$’s teaching policy checks $j$’s intended subgoal and decides whether to advise or not. Having decided to advise, $i$ transforms its tasklevel knowledge into desirable subgoal advice via its teacher policy and suggests it to $j$. After student $j$ accepts the advice from the teacher, the updated subgoal ${g}^{j}$ is passed to $j$’s worker policy, which then generates a primitive action ${a}^{j}$.
4 Details of HMAT
4.1 Algorithm of Hierarchical Multiagent Teaching
HMAT iterates over the following three phases to learn how to coordinate with deep hierarchical tasklevel policies (solving ${\mathcal{P}}_{\text{Task}}$) and how to provide advice using the teacher policies (solving ${\stackrel{~}{\mathcal{P}}}_{\text{Advise}}$). These phases are designed to address the teacher credit assignment issue with deep tasklevel policies. As discussed, identifying which portions of the advice led to successful student learning is difficult (see creftypecap 1). That issue is addressed by adapting ideas developed for learning an exploration policy for a single agent [35], which includes an extended view of actions (see below) and the use of a temporary policy for measuring a reward for the exploration policy. We adapt and extend these ideas from learningtoexplore in a singleagent setting into our learningtoteach in a multiagent setting. Pseudocode of HMAT is presented in creftype A.
Phase I (Advising Phase) Agents advise each another using their teacher policies according to the advising protocol during one episode. This process generates a batch of tasklevel experiences influenced by the teaching policy’s behavior. Specifically, we extend the concept of teacher policies by providing advice in the form of a sequence of multiple subgoals, ${\stackrel{~}{\bm{g}}}_{0:T}=\{{\stackrel{~}{g}}_{0:T}^{i},{\stackrel{~}{g}}_{0:T}^{j}\}$ (where ${\stackrel{~}{g}}_{0:T}^{i}=({\stackrel{~}{g}}_{0}^{i},{\stackrel{~}{g}}_{H}^{i},\text{\u2026},{\stackrel{~}{g}}_{T}^{i})$ denotes multiple subgoals in one episode) instead of just providing one piece of advice ${\stackrel{~}{\bm{g}}}_{t}=\{{\stackrel{~}{g}}_{t}^{i},{\stackrel{~}{g}}_{t}^{j}\}$ before updating tasklevel policies. One teacher action in this extended view corresponds to providing multiple pieces of advice during one episode, which contrasts with previous approaches to teaching in MARL that were updated based on a single piece of advice [1, 4, 5, 23, 31]. This extension is important, because by following advice ${\stackrel{~}{\bm{g}}}_{0:T}$ in an episode, a batch of tasklevel experiences for agent $i$ and $j$, ${\bm{E}}_{0:T}^{\text{advice}}=\{\u27e8{o}_{0:T}^{i},{\stackrel{~}{g}}_{0:T}^{i},{r}_{0:T},{o}_{0:T}^{\prime i}\u27e9,\u27e8{o}_{0:T}^{j},{\stackrel{~}{g}}_{0:T}^{j},{r}_{0:T},{o}_{0:T}^{\prime j}\u27e9\}$, are generated to enable stable minibatch updates of the DNN policies.
Phase II (Advice Evaluation Phase) Learning teacher policies requires reward feedback from the advice in phase I. Phase II evaluates and estimates the impact the advice had on improving teamwide learning progress, yielding the teacher policy’s reward. A temporary tasklevel policy is used to estimate the teacher reward for ${\stackrel{~}{\bm{g}}}_{0:T}$, so agents copy their current tasklevel policies to temporary policies (${\bm{\pi}}_{\text{temp}}\leftarrow \bm{\pi}$). To determine the teacher reward for ${\stackrel{~}{\bm{g}}}_{0:T}$, ${\bm{\pi}}_{\text{temp}}$ are updated for a small number of iterations using only ${\bm{E}}_{0:T}^{\text{advice}}$ (i.e., ${\bm{\pi}}_{\text{temp}}^{\prime}\leftarrow \mathrm{L}({\bm{\pi}}_{\text{temp}},{\bm{E}}_{0:T}^{\text{advice}})$). The updated temporary policies generate a batch of selfpractice experiences ${\bm{E}}_{0:\overline{T}}^{\text{selfprac.}}$ by rolling out a total of $\overline{T}$ timesteps without involving teacher policies. These selfpractice experiences, which are based on ${\bm{\pi}}_{\text{temp}}^{\prime}$ reflect how agents (on their own) would perform after the advising phase and can be used to estimate the impact of ${\stackrel{~}{\bm{g}}}_{0:T}$ on the teamwide learning. The function $\stackrel{~}{\mathcal{R}}$ (creftypecap 4.2) uses the selfpractice experiences to compute the teacher reward for ${\stackrel{~}{\bm{g}}}_{0:T}$ (i.e., $\stackrel{~}{r}=\stackrel{~}{\mathcal{R}}({\bm{E}}_{0:\overline{T}}^{\text{selfprac.}})$). A key point is that the temporary policies ${\bm{\pi}}_{\text{temp}}^{\prime}$ used to compute $\stackrel{~}{r}$ are only updated based on ${\bm{E}}_{0:T}^{\text{advice}}$ (i.e., experiences from the past iterations are not utilized), which resolves the teacher credit assignment issue.
Phase III (Policy Update Phase) Tasklevel policies are updated to solve ${\mathcal{P}}_{\text{Task}}$ by using the learning algorithms ${\mathbb{L}}^{i}$ and ${\mathbb{L}}^{j}$, and advicelevel policies are updated to solve ${\stackrel{~}{\mathcal{P}}}_{\text{Advise}}$ by using the learning algorithms ${\stackrel{~}{\mathbb{L}}}^{i}$ and ${\stackrel{~}{\mathbb{L}}}^{j}$. In particular, tasklevel policies, ${\pi}^{i}$ and ${\pi}^{j}$, are updated for the next iteration by randomly sampling experiences from tasklevel experience memories, ${\mathcal{D}}^{i}$ and ${\mathcal{D}}^{j}$. As in [35], both ${\bm{E}}_{0:T}^{\text{advice}}$ and ${\bm{E}}_{0:\overline{T}}^{\text{selfprac.}}$ are added to the tasklevel memories. Similarly, after adding the teacher experience collected from the advising (Phase I) and advice evaluation phases (Phase II) to teacher experience memories, ${\stackrel{~}{\mathcal{D}}}^{i}$ and ${\stackrel{~}{\mathcal{D}}}^{j}$, teacher policies are updated by randomly selecting samples from their replay memories, but at a slower update frequency than those of tasklevel policies.
4.2 Details of Teacher Policy
We explain important components of the teacher policy (focusing on teacher $i$ for clarity). See creftypecap B for additional details.
Teacher Observation and Action Teacherlevel observations ${\stackrel{~}{\bm{o}}}_{t}=\{{\stackrel{~}{o}}_{t}^{i},{\stackrel{~}{o}}_{t}^{j}\}$ compactly provide information about the nature of the heterogeneous knowledge between the two agents. Given ${\stackrel{~}{o}}_{t}^{i}$, teacher $i$ decides when and what to advise, with one action for deciding whether or not it should provide advice and another action for selecting the subgoal to give as advice. If no advice is provided, student $j$ executes its originally intended subgoal.
Teacher Reward Function Teacher policies aim to maximize cumulative teacher reward in a session that should result in faster teamwide learning progress. Recall in Phase II that the batch of selfpractice experiences reflect how agents by themselves perform after one advising phase. The open question is how to identify an appropriate teacher reward function $\stackrel{~}{\mathcal{R}}$ that can map selfpractice experiences into better learning performance. Intuitively, maximizing the reward returned by a teacher reward function $\stackrel{~}{r}=\stackrel{~}{\mathcal{R}}({\bm{E}}_{0:\overline{T}}^{\text{selfprac.}})$ means that teachers should advise so that the learning performance is maximized after one advising phase. In this work, we consider a new reward function, called current rollout (CR), which returns the sum of rewards in the selfpractice experiences ${\bm{E}}_{0:\overline{T}}^{\text{selfprac.}}$. We also evaluate different choices of teacher reward functions, including the ones in [23] and [35].
Teacher Experiences One teacher experience corresponds to ${\stackrel{~}{E}}_{0:T}^{i}=\u27e8{\stackrel{~}{o}}_{0:T}^{i},{\stackrel{~}{g}}_{0:T}^{i},\stackrel{~}{r},{\stackrel{~}{o}}_{0:T}^{\prime i}\u27e9$; where ${\stackrel{~}{o}}_{0:T}^{i}=({\stackrel{~}{o}}_{0}^{i},{\stackrel{~}{o}}_{H}^{i},{\stackrel{~}{o}}_{2H}^{i},\text{\u2026},{\stackrel{~}{o}}_{T}^{i})$ is a teacher observation; ${\stackrel{~}{g}}_{0:T}^{i}=({\stackrel{~}{g}}_{0}^{i},{\stackrel{~}{g}}_{H}^{i},{\stackrel{~}{g}}_{2H}^{i},\text{\u2026},{\stackrel{~}{g}}_{T}^{i})$ is a teacher action; $\stackrel{~}{r}$ is an estimated teacher reward with $\stackrel{~}{\mathcal{R}}$; and ${\stackrel{~}{o}}_{0:T}^{\prime i}=({\stackrel{~}{o}}_{0}^{\prime i},{\stackrel{~}{o}}_{H}^{\prime i},{\stackrel{~}{o}}_{2H}^{\prime i},\text{\u2026},{\stackrel{~}{o}}_{T}^{\prime i})$ is a next teacher observation, obtained by updating ${\stackrel{~}{o}}_{0:T}^{i}$ with the updated temporary policy ${\pi}_{\text{temp}}^{\prime j}$ (i.e., representing the change in student $j$’s knowledge due to advice ${\stackrel{~}{g}}_{0:T}^{i}$).
4.3 Training Protocol
TaskLevel Training To accommodate inherent offpolicy tasklevel policies (creftypecap 2.1), we use the offpolicy TD3 [10] algorithm to train the worker and manager policies. TD3 is an actorcritic algorithm which introduces two critics, ${Q}_{1}$ and ${Q}_{2}$, to reduce overestimation of Qvalue estimate in DDPG [17] and yields more robust learning performance. Originally, TD3 is a singleagent deep RL algorithm accommodating continuous spaces/actions. Here, we extend TD3 to multiagent settings with a resulting algorithm termed MATD3, and nonstationarity in MARL is addressed by applying centralized critics/decentralized actors [9, 18]. Another algorithm termed HMATD3 further extends MATD3 with HRL. In HMATD3, an agent $i$’s task policy critics, ${Q}_{1}^{i}$ and ${Q}_{2}^{i}$, minimize the following critic loss:
$$\mathcal{L}={\sum}_{\alpha =1}^{2}\underset{\u27e8\mathbf{o},\mathbf{g},r,{\mathbf{o}}^{\prime}\u27e9)\sim \mathcal{D}{}^{i}}{\mathbb{E}}{[y{Q}_{\alpha}^{i}(\mathbf{o},\mathbf{g})]}^{2},\text{s.t.}\mathit{\hspace{1em}}y=r+\gamma \underset{\beta =1,2}{\mathrm{min}}{Q}_{\beta ,\text{target}}^{i}({\mathbf{o}}^{\prime},{\pi}_{\text{target}}({\mathbf{o}}^{\prime})+\u03f5),$$  (1) 
where $\mathbf{o}=\{{o}^{i},{o}^{j}\}$; $\mathbf{g}=\{{g}^{i},{g}^{j}\}$; ${\mathbf{o}}^{\prime}=\{{o}^{\prime i},{o}^{\prime j}\}$; ${\pi}_{\text{target}}=\{{\pi}_{\text{target}}^{i},{\pi}_{\text{target}}^{j}\}$; the subscript “target” denotes the target network; and $\u03f5\sim \mathcal{N}(0,\sigma )$. The agent $i$’s actor policy ${\pi}^{i}$ with parameter ${\theta}^{i}$ is updated by:
$${\nabla}_{{\theta}^{i}}J({\theta}^{i})=\underset{\mathbf{o}\sim {\mathcal{D}}^{i}}{\mathbb{E}}\left[{{\nabla}_{{\theta}^{i}}{\pi}^{i}({g}^{i}{o}^{i}){\nabla}_{{g}^{i}}{Q}_{1}^{i}(\mathbf{o},\mathbf{g})}_{\mathbf{g}=\pi (\mathbf{o})}\right].$$  (2) 
AdviceLevel Training TD3 is also used for updating teacher policies. We modify creftype 1 and creftype 2 to account for the teacher’s extended view. Considering agent $i$ for clarity, agent $i$’s teacher policy critics, ${\stackrel{~}{Q}}_{1}^{i}$ and ${\stackrel{~}{Q}}_{2}^{i}$, minimize the following critic loss:
$\stackrel{~}{\mathcal{L}}$  $=$  ${\sum}_{\alpha =1}^{2}\underset{{\stackrel{\mathbf{~}}{\bm{E}}}_{0:T}\sim {\stackrel{~}{\mathcal{D}}}^{i}}{\mathbb{E}}{\left[\underset{\stackrel{\mathbf{~}}{\bm{E}}\sim {\stackrel{\mathbf{~}}{\bm{E}}}_{0:T}}{\mathbb{E}}(y{\stackrel{~}{Q}}_{\alpha}^{i}(\stackrel{~}{\mathbf{o}},\stackrel{~}{\mathbf{g}}))\right]}^{2},$  (3)  
$\text{s.t.}y$  $=$  $\stackrel{~}{r}+\gamma \underset{\beta =1,2}{\mathrm{min}}{\stackrel{~}{Q}}_{\beta ,\text{target}}^{i}({\stackrel{~}{\mathbf{o}}}^{\prime},{\stackrel{~}{\pi}}_{\text{target}}({\stackrel{~}{\mathbf{o}}}^{\prime})+\u03f5),{\stackrel{\mathbf{~}}{\bm{E}}}_{0:T}=\u27e8{\stackrel{~}{\mathbf{o}}}_{0:T},{\stackrel{~}{\mathbf{g}}}_{0:T},\stackrel{~}{r},{\stackrel{~}{\mathbf{o}}}_{0:T}^{\prime}\u27e9,\stackrel{\mathbf{~}}{\bm{E}}=\u27e8\stackrel{~}{\mathbf{o}},\stackrel{~}{\mathbf{g}},\stackrel{~}{r},{\stackrel{~}{\mathbf{o}}}^{\prime}\u27e9,$ 
where $\stackrel{~}{\mathbf{o}}=\{{\stackrel{~}{o}}^{i},{\stackrel{~}{o}}^{j}\}$; $\stackrel{~}{\mathbf{g}}=\{{\stackrel{~}{g}}^{i},{\stackrel{~}{g}}^{j}\}$; ${\stackrel{~}{\mathbf{o}}}^{\prime}=\{{\stackrel{~}{o}}^{\prime i},{\stackrel{~}{o}}^{\prime j}\}$; ${\stackrel{~}{\pi}}_{\text{target}}=\{{\stackrel{~}{\pi}}_{\text{target}}^{i},{\stackrel{~}{\pi}}_{\text{target}}^{j}\}$. The agent $i$’s actor policy ${\stackrel{~}{\pi}}^{i}$ with parameter ${\stackrel{~}{\theta}}^{i}$ is updated by:
$${\nabla}_{{\stackrel{~}{\theta}}^{i}}J({\stackrel{~}{\theta}}^{i})=\underset{{\stackrel{~}{\mathbf{o}}}_{0:T}\sim {\stackrel{~}{\mathcal{D}}}^{i}}{\mathbb{E}}\left[\underset{\stackrel{~}{\mathbf{o}}\sim {\stackrel{~}{\mathbf{o}}}_{0:T}}{\mathbb{E}}(z)\right],$$  (4) 
where $z={{\nabla}_{{\stackrel{~}{\theta}}^{i}}{\stackrel{~}{\pi}}^{i}({\stackrel{~}{g}}^{i}{\stackrel{~}{o}}^{i}){\nabla}_{{\stackrel{~}{g}}^{i}}{\stackrel{~}{Q}}_{1}^{i}(\stackrel{~}{\mathbf{o}},\stackrel{~}{\mathbf{g}})}_{\stackrel{~}{\mathbf{g}}=\stackrel{~}{\pi}(\stackrel{~}{\mathbf{o}})}$.
5 Related Work
Imitation learning studies how to learn a policy from expert demonstrations [6, 28, 27]. Recent work applied imitation learning for multiagent coordination [16] and explored effective combinations of imitation learning and hierarchical RL [15]. Curriculum learning [3, 12, 32], which progressively increases task difficulty, is also relevant. While most work in this topic focuses on learning knowledge for solving single agent problems, we study peertopeer knowledge transfer in cooperative MARL. Related work by Xu et al. [35] learns an exploration policy, which relates to our approach to learning teaching policies. Their approach led to unstable learning in our setting, motivating our policy update rule (creftype 4.3). We also consider various teacher reward functions, including the one in [35], and a new reward function of CR empirically performs better in our domains.
6 Evaluation
We demonstrate HMAT’s performance in increasingly challenging domains that involve continuous states/actions, long horizons, and delayed rewards.
6.1 Evaluation Domains and Tasks
Our domains are based on the OpenAI’s multiagent particle environment that supports continuous observation/action spaces. We modify the environment and propose new domains, called cooperative one/two box push, that require coordinating agents (see creftype C for additional details).
Cooperative One Box Push (COBP) The domain consists of one round box and two agents (creftype 1(a)). The objective is to move the box to the target on the left side as soon as possible. The box can be moved if and only if two agents act on it together. This unique property requires that the agents coordinate. The domain has a delayed reward because there is no change in reward until the box is moved by the two agents. An episode in COBP ends when $t$ exceeds $50$ timesteps.
Cooperative Two Box Push (CTBP) This domain is similar to the one box domain but with increased complexity. There are two round boxes in the domain (creftype 1(b)). The objective is to move the left box (box$1$) to the left target (target$1$) and the right box (box$2$) to the right target (target$2$). In addition, the boxes have different mass – box$2$ is 3x heavier than box$1$. An episode in CTBP ends when $t$ exceeds $100$ timesteps.
Heterogeneous Knowledge For each domain, we provide each agent with a different set of priors to ensure heterogeneous knowledge between them and motivate interesting teaching scenarios. For the COBP (creftype 1(a)), agent $i$ and $j$ are first trained to move the box to the target. Then, agents $i$ and $k$ are teamed up, and agent $k$ has no knowledge about the domain. Agent $i$, which understands how to move the box, should teach agent $k$ by giving good advice to improve $k$’s learning progress. For the CTBP (creftype 1(b)), agents $i$ and $j$ have received prior training about how to move box$1$ to target$1$, and agents $k$ and $l$ understand how to move box$2$ to target$2$. However, these two teams have different skills as the tasks involve moving boxes with different weights (light vs heavy) and also in different directions (left vs right). Then agents $i$ and $k$ are teamed up, and in this scenario, agent $i$ should transfer its knowledge about moving box$1$ to agent $k$. Meanwhile, agent $k$ should teach agent $i$ how to move box$2$, so that there is a twoway transfer of knowledge where each agent is the primary teacher at one point and primary student at another.
Algorithm  Hierarchical?  Teaching?  One Box Push  Two Box Push  
$\overline{V}$  AUC  $\overline{V}$  AUC  
MATD3  ✗  ✗  $12.64\pm 3.42$  $148\pm 105$  $44.21\pm 5.01$  $1833\pm 400$ 
LeCTR–Tile  ✗  ✓  $18.00\pm 0.01$  $6\pm 0$  $60.50\pm 0.05$  $246\pm 3$ 
LeCTR–D  ✗  ✓  $14.85\pm 2.67$  $81\pm 80$  $46.07\pm 3.00$  $1739\pm 223$ 
LeCTR–OD  ✗  ✓  $17.92\pm 0.10$  $17\pm 3$  $60.43\pm 0.08$  $272\pm 26$ 
AI  ✗  ✓  \B$$11.33 $\pm $ 2.46  $162\pm 86$  $41.09\pm 3.50$  $2296\pm 393$ 
AICI  ✗  ✓  \B$$10.60 $\pm $ 0.85  $200\pm 74$  $38.23\pm 0.45$  $2742\pm 317$ 
MAT  ✗  ✓  \B$$10.04 $\pm $ 0.38  $274\pm 38$  $39.49\pm 2.93$  $2608\pm 256$ 
HMATD3  ✓  ✗  \B$$10.24 $\pm $ 0.20  $427\pm 18$  $31.55\pm 3.51$  $4288\pm 335$ 
LeCTR–HD  ✓  ✓  $12.10\pm 0.94$  $265\pm 59$  $38.87\pm 2.94$  $2820\pm 404$ 
LeCTR–OHD  ✓  ✓  $16.77\pm 0.66$  $70\pm 22$  $60.23\pm 0.33$  $306\pm 28$ 
HAI  ✓  ✓  \B$$10.23 $\pm $ 0.19  $427\pm 9$  $31.37\pm 3.71$  $4400\pm 444$ 
HAICI  ✓  ✓  \B$$10.25 $\pm $ 0.26  $433\pm 5$  $29.32\pm 1.19$  $4691\pm 261$ 
HMAT  ✓  ✓  \B$$10.10 $\pm $ 0.19  \B458 $\pm $ 8  \B$$27.49 $\pm $ 0.96  \B5032 $\pm $ 186 
6.2 Baselines
We compare to several baselines in order to provide context for the performance of HMAT (see creftypecap D for additional details).
NoTeaching Baselines: MATD3 and HMATD3 are the baselines for a primitive and hierarchical MARL without teaching, respectively.
LeCTR Baselines: This includes the LeCTR framework [23] with tasklevel policies that use tilecoding (LeCTR–Tile). We also consider modifications of that framework in which the tasklevel policies are learned with deep RL methods: MATD3 (LeCTR–D) and HMATD3 (LeCTR–HD). Lastly, we compare two LeCTR baselines using online MATD3 (LeCTR–OD) and online HMATD3 (LeCTR–OHD), where the online update denotes the policy update with the most recent experience.
Heuristic Teaching Baselines: Two heuristicbased primitive teaching baselines, Ask Important (AI) and Ask Important–Correct Important (AICI) [1], are compared. In AI, each student asks for advice based on the importance of a state using the student’s $Q$values. When asked, the teacher agent always advises with its best primitive action at a given student state. Students in AICI also ask for advice, but each teacher can decide whether to advise with its best action (or not to advise). The teacher decides based on the state importance using the teacher’s $Q$values and the difference between the student’s intended action and teacher’s intended action at a student state. AICI is one of the best performing heuristic algorithms in [1]. Hierarchical AI (HAI) and hierarchical AICI (HAICI) are similar to AI and AICI, but teaches in the hierarchical settings (i.e., managers teach each other).
HMAT Variants: A nonhierarchical variant of HMAT, called MAT, is compared under different choices of teacher reward functions. For clarity, we only show results with the CR reward function. Performance comparisons between different reward functions are shown in creftype B.
6.3 Results on One Box and Two Box Push
creftypecap 1 compares HMAT and its baselines. The results show both final tasklevel performance ($\overline{V}$) and area under the tasklevel learning curve (AUC) – higher values are better for both metrics. The results demonstrate improved tasklevel learning performance with HMAT compared to HMATD3 and with MAT compared to MATD3, as indicated by the higher final performance ($\overline{V}$) and larger rate of learning (AUC) in creftypeplural 2(b)\crefpairconjunction2(a). HMAT also shows better performance than MAT. These results demonstrate the benefits of the highlevel advising, which helps address the long time horizons and delayed rewards in these two domains.
HMAT also achieves better performance than the LeCTR baselines. LeCTR–Tile shows the smallest $\overline{V}$ and AUC due to the limitations of the tilecoding representation of the policies in these complex domains. Teacher policies in LeCTR–D and LeCTR–HD have poor estimates of the teacher credit assignment with deep tasklevel policies, that result in unstable learning of the teaching policies and worse performance than the noteaching baselines (i.e., LeCTR–D vs MATD3, LeCTR–HD vs HMATD3). In contrast, both LeCTR–OD and LeCTR–OHD have good estimates of the teacher credit assignment as the tasklevel policies are updated online. However, these two approaches suffer from instability caused by the absence of minibatch update for the DNN policies. Finally, HMAT attains the best performance in terms of $\overline{V}$ and AUC, compared to the heuristicsbased baselines.
These combined results demonstrate the key advantage of HMAT in that it can accelerate the learning progress for complex tasks with continuous states/actions, long horizons, and delayed rewards.
6.4 Results on Transferability and Heterogeneous Action Spaces
HMAT can learn highlevel teaching strategies transferable to different types of agents and/or tasks. We evaluate both transferability in the following two perspectives and teaching with heterogeneous actions spaces. Below numerical results are mean and standard deviation computed for $10$ sessions.
Transfer across Different Student Types We first create a small population of students, each having different knowledge. Specifically, we create $7$ students that can push one box to distinct areas in the onebox push domain: topleft, top, topright, right, bottomright, bottom, and bottomleft. This population is divided into train $\{$topleft, top, bottomright, bottom, and bottomleft$\}$, validation $\{$topright$\}$, and test $\{$right$\}$ groups. After the teacher policy has converged, we fix the policy and transfer it to a different setting in which the teacher advises a student in the test group. Although the teacher has never interacted with the student in the test group before, it achieves an AUC of $400\pm 10$, compared to noteaching baseline (HMATD3) AUC of $369\pm 27$.
Transfer across Different Tasks We first train the teacher in the one box push domain that learns to transfer knowledge to agent $k$ about how to move the box to the left. Then we fix the converged teacher policy and evaluate on a different task of moving the box to right. While tasklevel learning without teaching achieves AUC of $363\pm 43$, tasklevel learning with teaching achieves AUC of $414\pm 11$. Thus, learning is faster, even when using pretrained teacher policies from different tasks.
Teaching with Heterogeneous Action Spaces We consider heterogeneous action space variants in the onebox push domain, where agent $k$ has a remapped primitive action space (e.g., $180\mathrm{\xb0}$ rotation) compared to its teammate $i$. The original heuristicbased algorithms advise at primitive actions and thus assume the action space homogeneity. As creftype 2(c) shows, AICI has almost $50\%$ decreased mean AUC when the action space is remapped. Advising with HRL helps to teach a teammate with a heterogeneous action space, and HMAT achieves the best performance regardless of action rotation.
7 Conclusion
The paper presents HMAT, which utilizes deep function approximations and HRL to transfer knowledge between agents in cooperative MARL problems. We propose a method to overcome the teacher credit assignment issue and show accelerated learning progress in challenging domains. Ongoing work is expanding the framework to more than two agents.
Acknowledgements
This work was supported by IBM (as part of the MITIBM Watson AI Lab initiative) and AWS Machine Learning Research Awards program. DongKi Kim was also supported by Kwanjeong Educational Foundation Fellowship.
References
 [1] O. Amir, E. Kamar, A. Kolobov, and B. J. Grosz. Interactive teaching strategies for agent training. International Joint Conferences on Artificial Intelligence (IJCAI), 2016.
 [2] P. Bacon, J. Harb, and D. Precup. The optioncritic architecture. In Association for the Advancement of Artificial Intelligence (AAAI), pages 1726–1734, 2017.
 [3] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.
 [4] J. A. Clouse. On integrating apprentice learning and reinforcement learning. 1996.
 [5] F. L. da Silva, R. Glatt, and A. H. R. Costa. Simultaneously learning and advising in multiagent reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 1100–1108, 2017.
 [6] M. Daswani, P. Sunehag, and M. Hutter. Reinforcement learning with value advice. In D. Phung and H. Li, editors, Proceedings of the Sixth Asian Conference on Machine Learning, volume 39 of Proceedings of Machine Learning Research, pages 299–314. PMLR, 26–28 Nov 2015.
 [7] P. Dayan and G. E. Hinton. Feudal reinforcement learning. In Advances in neural information processing systems, pages 271–278, 1993.
 [8] T. G. Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. J. Artif. Int. Res., 13(1):227–303, Nov. 2000.
 [9] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson. Counterfactual multiagent policy gradients. arXiv preprint arXiv:1705.08926, 2017.
 [10] S. Fujimoto, H. van Hoof, and D. Meger. Addressing function approximation error in actorcritic methods. In International Conference on Machine Learning, pages 1587–1596, 2018.
 [11] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
 [12] A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu. Automated curriculum learning for neural networks. CoRR, abs/1704.03003, 2017.
 [13] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.
 [14] T. D. Kulkarni, K. R. Narasimhan, A. Saeedi, and J. B. Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 3682–3690, 2016.
 [15] H. M. Le, N. Jiang, A. Agarwal, M. Dudík, Y. Yue, and H. Daumé III. Hierarchical imitation and reinforcement learning. arXiv preprint arXiv:1803.00590, 2018.
 [16] H. M. Le, Y. Yue, P. Carr, and P. Lucey. Coordinated multiagent imitation learning. In International Conference on Machine Learning, pages 1995–2003, 2017.
 [17] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015.
 [18] R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch. Multiagent actorcritic for mixed cooperativecompetitive environments. In Advances in Neural Information Processing Systems, pages 6382–6393, 2017.
 [19] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pages 1928–1937. PMLR, 2016.
 [20] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Humanlevel control through deep reinforcement learning. Nature, pages 529–533, 2015.
 [21] O. Nachum, S. Gu, H. Lee, and S. Levine. Dataefficient hierarchical reinforcement learning. CoRR, abs/1805.08296, 2018.
 [22] F. A. Oliehoek and C. Amato. A Concise Introduction to Decentralized POMDPs. SpringerBriefs in Intelligent Systems. Springer, May 2016.
 [23] S. Omidshafiei, D. Kim, M. Liu, G. Tesauro, M. Riemer, C. Amato, M. Campbell, and J. P. How. Learning to teach in cooperative multiagent reinforcement learning. CoRR, abs/1805.07830, 2018.
 [24] R. Parr and S. Russell. Reinforcement learning with hierarchies of machines. In Proceedings of the 1997 Conference on Advances in Neural Information Processing Systems 10, pages 1043–1049, 1998.
 [25] M. Riemer, M. Liu, and G. Tesauro. Learning abstract options. In Advances in Neural Information Processing Systems, pages 10445–10455, 2018.
 [26] E. M. Rogers. Diffusion of innovations. Simon and Schuster, 2010.
 [27] S. Ross and J. A. Bagnell. Reinforcement and imitation learning via interactive noregret learning. CoRR, abs/1406.5979, 2014.
 [28] S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to noregret online learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15, pages 627–635. PMLR, 2011.
 [29] R. S. Sutton, D. Precup, and S. Singh. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1):181 – 211, 1999.
 [30] M. E. Taylor and P. Stone. Transfer learning for reinforcement learning domains: A survey. J. Mach. Learn. Res., 10:1633–1685, Dec. 2009.
 [31] L. Torrey and M. Taylor. Teaching on a budget: Agents advising agents in reinforcement learning. In Proceedings of the 2013 international conference on Autonomous agents and multiagent systems, pages 1053–1060. International Foundation for Autonomous Agents and Multiagent Systems, 2013.
 [32] Y. Tsvetkov, M. Faruqui, W. Ling, B. MacWhinney, and C. Dyer. Learning the curriculum with bayesian optimization for taskspecific word representation learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 130–139. Association for Computational Linguistics, 2016.
 [33] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. FeUdal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 3540–3549. PMLR, 2017.
 [34] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017.
 [35] T. Xu, Q. Liu, L. Zhao, and J. Peng. Learning to explore with metapolicy gradient. CoRR, abs/1803.05044, 2018.
Appendix A HMAT Pseudocode
Appendix B Additional Details of Teacher Policy
B.1 Teacher Observation
While agents can simultaneously advise one another, for clarity, we detail the teacher observation when agents $i$ and $j$ are the teacher and student agent, respectively. For agent $i$’s teacher policy, its observation ${\stackrel{~}{o}}_{t}^{i}$ consists of:
$${\stackrel{~}{o}}_{t}^{i}=\{\underset{\text{Teacher Knowledge}}{\underset{\u23df}{{\bm{o}}_{t},{g}_{t}^{i},{g}_{t}^{ij},{Q}^{i}({\bm{o}}_{t},{g}_{t}^{i},{g}_{t}^{j}),{Q}^{i}({\bm{o}}_{t},{g}_{t}^{i},{g}_{t}^{ij})}},\underset{\text{Student Knowledge}}{\underset{\u23df}{{g}_{t}^{j},{Q}^{j}({\bm{o}}_{t},{g}_{t}^{i},{g}_{t}^{j}),{Q}^{j}({\bm{o}}_{t},{g}_{t}^{i},{g}_{t}^{ij})}},$$  
$$\underset{\text{Misc.}}{\underset{\u23df}{{R}_{\text{Phase I}},{R}_{\text{Phase II}},{t}_{\text{Remain}}}}\},$$ 
where ${\bm{o}}_{t}=\{{o}_{t}^{i},{o}_{t}^{j}\}$; ${g}_{t}^{i}\sim {\pi}^{i}({o}_{t}^{i})$; ${g}_{t}^{j}\sim {\pi}^{j}({o}_{t}^{j})$; ${g}_{t}^{ij}\sim {\pi}^{i}({o}_{t}^{j})$; ${Q}^{i}$, ${Q}^{j}$ are the centralized critics for agent $i$ and $j$, respectively; ${R}_{\text{Phase I}}$ and ${R}_{\text{Phase II}}$ are average rewards in the last few iterations in Phase I and Phase II, respectively; and ${t}_{\text{Remain}}$ is the remaining time in current session.
B.2 Teacher Reward Function
Recall in the advice evaluation phase of HMAT (Phase II) that the batch of selfpractice experiences ${\bm{E}}_{0:\overline{T}}^{\text{selfprac.}}$ reflects how agents by themselves perform after one advising phase. Then, the next important question is to identify an appropriate teacher reward function $\stackrel{~}{\mathcal{R}}$ that can transform the selfpractice experiences into the learning performance. In this section, we detail our choices of teacher reward functions, including the ones in [23] and [35], as described in creftype 2.
Teacher Reward Name  Description  Teacher Reward ${\stackrel{~}{r}}^{i}$ 
VEG: Value Estimation Gain  Student’s Qvalue above threshold $\tau $ [23]  $\mathrm{\U0001d7d9}({Q}^{\text{student}}>\tau )$ 
DR: Difference Rollout  Difference in rollout reward before/after advising [35]  $\widehat{R}{\widehat{R}}_{\text{before}}$ 
CR: Current Rollout  Rollout reward after advising phase  $\widehat{R}$ 
Value Estimation Gain (VEG) The value estimation gain (VEG) teacher reward function is introduced and performed the best in the Learning to Coordinate and Teach Reinforcement (LeCTR) framework [23]. VEG rewards a teacher when its student’s Qvalue exceeds a threshold $\tau $: $\mathrm{\U0001d7d9}({Q}^{\text{student}}>\tau )$. The motivation of VEG is that the student’s Qvalue is correlated to its estimated learning progress [23]. creftypecap 2 describes how teachers receive the VEG teacher reward. Note that the Qfunctions in the algorithm correspond to the ones of the updated temporary tasklevel policies ${\bm{\pi}}_{\text{temp}}^{\prime}$. The threshold values of $6.0$ and $20.0$ are used for the onebox and twobox push domain, respectively.
Difference Rollout (DR) The difference rollout (DR) teacher reward function is used in the work of Xu et al. [35] for learning an exploration policy for a single agent. DR requires another selfpractice experiences collected before the advising phase ${\bm{E}}_{0:\overline{T}}^{\text{selfprac., before}}$. Then, the teacher reward is calculated by the difference between the sum of rewards in ${\bm{E}}_{0:\overline{T}}^{\text{selfprac.}}$ (denoted by $\widehat{R}$) and the sum of rewards in ${\bm{E}}_{0:\overline{T}}^{\text{selfprac., before}}$ (denoted by ${\widehat{R}}_{\text{before}}$).
Current Rollout (CR) The current rollout (CR) is a simpler reward function than DR. CR returns the sum of rewards in the selfpractice experiences ${\bm{E}}_{0:\overline{T}}^{\text{selfprac.}}$.
Performance Comparisons Our empirical experiments in creftype 3 show that CR performs the best with HMAT. Consistent with this observation, the learning progress estimated with CR has a very high correlation with the true learning progress (see creftype E).
B.3 Teacher Experience
In this section, we detail how the next teacher observation ${\stackrel{~}{o}}_{0:T}^{\prime i}$ in the teacher experience ${\stackrel{~}{E}}_{0:T}^{i}=\u27e8{\stackrel{~}{o}}_{0:T}^{i},{\stackrel{~}{g}}_{0:T}^{i},\stackrel{~}{r},{\stackrel{~}{o}}_{0:T}^{\prime i}\u27e9$ is obtained (focusing on teacher $i$ for clarity). To get the next teacher observation, the teacher’s observation ${\stackrel{~}{o}}_{0:T}^{i}$ is updated by replacing student $j$’s Qvalues (see creftype B.1) with the the student $j$’s updated temporary policy ${\pi}_{\text{temp}}^{\prime j}$ (i.e., representing the change in student $j$’s knowledge due to advice ${\stackrel{~}{g}}_{0:T}^{i}$). ${R}_{\text{Phase I}}$, ${R}_{\text{Phase II}}$, and ${t}_{\text{Remain}}$ are also updated in the next teacher observation.
Algorithm  Hierarchical?  Teaching?  One Box Push  Two Box Push  
$\overline{V}$  AUC  $\overline{V}$  AUC  
HMAT (with VEG)  ✓  ✓  $$10.38 $\pm $ 0.25  $424\pm 28$  $$29.73 $\pm $ 2.89  4694 $\pm $ 366 
HMAT (with DR)  ✓  ✓  $$10.36 $\pm $ 0.32  $417\pm 31$  \B$$28.12 $\pm $ 1.58  4758 $\pm $ 286 
HMAT (with CR)  ✓  ✓  \B$$10.10 $\pm $ 0.19  \B458 $\pm $ 8  \B$$27.49 $\pm $ 0.96  \B5032 $\pm $ 186 
Appendix C Additional Experimental/Domain Details
C.1 Implementation Details
Each policy’s actor and critic are twolayer feedforward neural networks, consisting of the rectified linear unit (ReLU) activations. In terms of the tasklevel actor policies, a final layer of the tanh activation that outputs between $1$ and $1$ is used at the output. Actors for the worker and primitive tasklevel policies output two actions that correspond to x–y forces to move in the one box and two box domains. Similarly, actors for the manager tasklevel policies output two actions, but they correspond to subgoals of (x, y) coordinate.
HMAT’s teacherlevel actor policies output four actions, where the first two outputs correspond to what to advise (i.e., continuous subgoal advice of (x, y) coordinate) and the last two outputs correspond to when to advise (i.e., discrete actions converted to the onehot encoding). Two separate final layers of the tanh activation and the linear activation are applied to what to advise actions and when to advise actions, respectively. Note that the GumbelSoftmax estimator [13] is used to compute gradients for the discrete actions of when to advise.
Because our focus is teaching at the managerlevel, not at the workerlevel, we pretrain and fix the worker policies by giving randomly generated subgoals. Per Section 2.2, the intrinsic reward function to pretrain the worker policy is the negative distance between the current position and subgoal: ${r}_{t}^{\text{intrinsic}}={{o}_{t}{g}_{t}}_{2}^{2}$, where ${o}_{t}$ and ${g}_{t}$ correspond to the current position and subgoal, respectively. All hierarchical methods presented in this work use pretrained workers.
The Amazon Elastic Compute Cloud (EC2) of C5 and P3 instances are used for the computing infrastructures.
C.2 Cooperative OneBox Push Domain

•
Each agent’s observation include its position/speed, the position of the box, the position of the left target, and the other agent’s position.

•
Every episode, the box and the left target initialize at ($0.25,0.00$) and ($0.85,0.00$), respectively. Agents initialize at random locations.

•
The domain returns a team reward ${r}_{t}={\text{loc(left target)}\text{loc(box)}}_{2}^{2}$.

•
The box and the agents have a radius of $0.25$ and $0.10$, respectively.

•
The width and height of the domain are from $1$ to $1$.

•
Maximum timestep per episode $T$ is $50$ timesteps.

•
All methods use discount factor of $\gamma =0.99$.

•
The maximum number of episodes in a session $S$ is $600$ episodes.

•
Adam optimizer; actor learning rate of $0.0001$ and critic learning rate of $0.001$.

•
As in TD3 [10], tasklevel policies for all methods use first $2000$ timesteps for a purely random exploration followed by an added Gaussian noise. The added Gaussian noise has mean of $0$ and a standard deviation of $0.10$. Empirically, adding the purely random exploration results in more stable results.

•
Managers in all hierarchical methods generate subgoal every $H=5$ timesteps.

•
Selfpractice rollout timestep $\overline{T}$ is $100$ timesteps (two episodes).

•
Teacher update frequency ${f}_{\text{teacher}}$ of $\{15,30,45\}$ episodes.

•
Tasklevel batch size $50$; Teacherlevel batch size of $\{16,32,64\}$.

•
We focus on knowledge transfer from agent $i$ to agent $k$ in this domain. Recall that agent $i$ already understands how to move the box to the left side. Therefore, we fix agent $i$’s tasklevel policy’s weight, but train agent $k$ tasklevel policy throughout the experiments.
C.3 Cooperative TwoBox Push Domain

•
Each agent’s observation include its position/speed, the positions of the boxes, the positions of the targets, and the other agent’s position.

•
Every episode, the two boxes initialize at ($0.30,0.00$) and ($0.30,0.00$), respectively. The two targets initialize at ($0.85,0.00$) and ($0.85,0.00$), respectively. Agents reset at random locations.

•
The boxes have a radius of $0.25$. The agents have a radius of $0.10$.

•
The domain returns a team reward
${r}_{t}=\left\{{\text{loc(left target)}\text{loc(left box)}}_{2}^{2}+{\text{loc(right target)}\text{loc(right box)}}_{2}^{2}\right\}$. 
•
The left box has a mass of $2$ and the right box has a mass of $6$.

•
The width and height of the domain are from $1$ to $1$.

•
Maximum timestep per episode $T$ is $100$ timesteps.

•
All methods use discount factor of $\gamma =0.99$.

•
The maximum number of episodes in a session $S$ is $1800$ episodes.

•
Adam optimizer; actor learning rate of $0.0001$ and critic learning rate of $0.001$.

•
Tasklevel policies for all methods use first $2000$ timesteps for a purely random exploration followed by an added Gaussian noise. The added Gaussian noise has mean of $0$ and a standard deviation of $0.10$.

•
Managers in all hierarchical methods generate subgoal every $H=5$ timesteps.

•
Selfpractice rollout timestep $\overline{T}$ is $200$ timesteps (two episodes).

•
Teacher update frequency ${f}_{\text{teacher}}$ of $\{15,30,45\}$ episodes.

•
Tasklevel batch size $50$; Teacherlevel batch size of $\{16,32,64\}$.

•
Both agent $i$ and agent $k$’s tasklevel policies are trained.
Appendix D Baseline Details
D.1 LeCTR Baseline Details
In terms of the original LeCTR framework (LeCTR–Tile), various parameters related to the tilecoding tasklevel policies are evaluated. Specifically, we attempt various combinations: the hash size values of $\{1024,4096,16384\}$; number of tiling of $\{16,32,64,256\}$; number of tiles per tiling of $\{5,10,20\}$; and $\gamma $ of $\{\mathrm{0.90.0.95.0.99}\}$.
D.2 HeuristicBased Teaching Baseline Details
Heuristicbased teaching methods [1, 4, 5, 31] use various heuristics to decide when to advise. When decided to advise, a teacher agent advises with its best action at a student state/observation. We compare the Ask Important (AI) heuristic [1] and the Ask Important–Correct Important (AICI) heuristic [1] in our experiments. We also consider when AI and AICI methods are applied in the hierarchical settings: hierarchical AI (HAI) and hierarchical AICI (HAICI). As heuristicbased teaching approaches are based on tilecoding/discrete action space, we apply minor modifications to these approaches to combine with MATD3 or HMATD3. In this section, we detail our modifications to AI, AICI, HAI, and HAICI.
AI and HAI In AI, a student agent $j$ asks for an advice when the importance of a state ${I}^{\text{student}}$ is larger than a threshold ${\tau}^{\text{student}}$: ${I}^{\text{student}}={\mathrm{max}}_{a}{Q}^{\text{student}}({o}^{j},a){\mathrm{min}}_{a}{Q}^{\text{student}}({o}^{j},a)>{\tau}^{\text{student}}$. ${I}^{\text{student}}$ represents an important state when receiving an advice can yield a significantly better performance, as determined by the student agent’s Qfunction. We make the following modifications to measure ${I}^{\text{student}}$ with MATD3. First, we uniformly sample $20$ actions to effectively represent all possible actions. Second, ${I}^{\text{student}}$ is modified to include the centralized critics. Specifically, considering the teacher agent $i$ and the student agent $j$ for clarity, the student $j$ asks for advice when: ${I}^{\text{student}}={\mathrm{max}}_{a\in \widehat{A}}{Q}^{j}({o}^{i},{o}^{j},{a}^{i},a){\mathrm{min}}_{a\in \widehat{A}}{Q}^{j}({o}^{i},{o}^{j},{a}^{i},a)>{\tau}^{\text{student}}$; where $\widehat{A}$ denotes the set of sampled $20$ actions; and ${a}^{i}\sim {\pi}^{i}({o}^{i})$. Several ${\tau}^{\text{student}}$ values of $\{0.3,0.6,0.9\}$ are evaluated, and AI performs the best with the value of $0.6$.
HAI is similar to AI, except the hierarchical settings. Specifically, in HAI, a teacher agent gives the subgoal advice, instead of the primitive action advice, and larger values of $\{1.5,3.0,4.5\}$ for ${\tau}^{\text{student}}$ are evaluated to account the manager’s subgoal generation frequency of every $H=5$ timesteps. HAI performs the best with the value of $3.0$.
AICI and HAICI As in AI, students in AICI ask for advice when the ${I}^{\text{student}}$ is larger than a threshold ${\tau}^{\text{student}}$. When asked, however, teachers can decide whether to give advice or not based on ${I}^{\text{teacher}}$ and the difference between the student’s intended action and teacher’s intended action at student state. Specifically, when asked, teacher agent $i$ decides to advise student $j$ if ${I}^{\text{teacher}}={\mathrm{max}}_{a}{Q}^{\text{teacher}}({o}^{j},a){\mathrm{min}}_{a}{Q}^{\text{teacher}}({o}^{j},a)>{\tau}^{\text{teacher}}$ and ${a}^{j}\sim {\pi}^{j}({o}^{j})\ne {a}^{ij}\sim {\pi}^{i}({o}^{j})$. Similar to AI, minor modifications are made for AICI to advise MATD3 tasklevel policies. Student $j$ asks for an advice as in AI, but teacher $i$ in AICI decides to advise when: ${I}^{\text{teacher}}={\mathrm{max}}_{a\in \widehat{A}}{Q}^{i}({o}^{i},{o}^{j},{a}^{i},a){\mathrm{min}}_{a\in \widehat{A}}{Q}^{i}({o}^{i},{o}^{j},{a}^{i},a)>{\tau}^{\text{teacher}}$ and ${{a}^{j}{a}^{ij}}_{2}^{2}>{\tau}^{\text{action}}$; where $\widehat{A}$ denotes the set of sampled $20$ actions. Combinations of ${\tau}^{\text{student}}$ values of $\{0.3,0.6,0.9\}$ and ${\tau}^{\text{teacher}}$ values of $\{0.3,0.6,0.9\}$ are evaluated with ${\tau}^{\text{action}}$ of $0.3$. AICI performs the best with ${\tau}^{\text{student}}$ value of $0.3$ and ${\tau}^{\text{teacher}}$ value of $0.6$.
Similarly for HAICI, combinations of ${\tau}^{\text{student}}$ values of $\{1.5,3.0,4.5\}$ and ${\tau}^{\text{teacher}}$ values of $\{1.5,3.0,4.5\}$ are evaluated with ${\tau}^{\text{action}}$ of $0.3$. HAICI performs the best with ${\tau}^{\text{student}}$ value of $1.5$ and ${\tau}^{\text{teacher}}$ value of $3.0$.
Appendix E Analysis of Teacher Reward Function
Developing groundtruth learning progress of tasklevel policies often requires an expert policy and could be computationally undesirable [12]. Thus, our framework uses an estimation of the learning progress as a teacher reward. However, it is important to understand how close the estimation is to the true learning progress. The goal of teacher policies is to maximize the cumulative teacher reward, so a poor estimate of teaching reward would result in learning undesirable teachers. In this section, we aim to measure the differences between the true and estimated learning progress and analyze the CR teacher reward function, which performed the best.
In imitation learning, with an assumption of a given expert, one method to measure the true learning progress is by measuring the distance between an action of a learning agent and an optimal action of an expert [27, 6]. Similarly, we pretrain expert policies (HMATD3) and measure the true learning progress by the action differences. The comparison between the true and the estimated learning progress using the CR teacher reward function is shown in creftype 4. The Pearson correlation is $0.946$, which empirically shows that the CR teacher reward function well estimates the learning progress.
Appendix F Asynchronous Hierarchical Multiagent Teaching
Similar to a deep RL algorithm requiring millions of episodes to learn a useful policy [20], our teacher policies would require many sessions to learn. As one session consists of many episodes, much time might be needed until teacher policies converge. We address this potential issue with asynchronous policy update with multithreading as in asynchronous advantage actorcritic (A3C) [19]. A3C demonstrated a reduction in training time that is roughly linear in the number of threads. We also show that our HMAT variant, asynchronous HMAT, achieves a roughly linear reduction in training time as a function of the number of threads (see creftype 5).