Learning Hierarchical Teaching in Cooperative Multiagent Reinforcement Learning

  • 2019-05-30 20:13:48
  • Dong Ki Kim, Miao Liu, Shayegan Omidshafiei, Sebastian Lopez-Cot, Matthew Riemer, Golnaz Habibi, Gerald Tesauro, Sami Mourad, Murray Campbell, Jonathan P. How
  • 0

Abstract

In cooperative multiagent reinforcement learning, agents commonly acquireheterogeneous knowledge. Learning across the team can be greatly improved ifagents can effectively exchange their knowledge to other agents. In particular,recent work showed that action advising, a form of peer-to-peer knowledgetransfer from teacher agents to student agents, improves team-wide learning.However, that prior work on action advising only considered advising withprimitive (low-level) actions, which limits scalability. This paper introducesa novel learning-to-teach framework, called hierarchical multiagent teaching(HMAT), in which the teacher advice may include extended action sequences overmultiple levels of temporal abstraction. The empirical evaluations show thatHMAT accelerates team-wide learning progress in environments that are morecomplex than considered in previous learning-to-teach research. HMAT is alsoshown to learn teaching policies that can be transferred to differentteammates/tasks, even when teammates have heterogeneous action spaces.

 

Quick Read (beta)

Learning Hierarchical Teaching in Cooperative Multiagent Reinforcement Learning

Dong-Ki Kim,
[email protected] &Miao Liu22footnotemark: 2 ,
[email protected] &Shayegan Omidshafiei11footnotemark: 1 ,22footnotemark: 2 ,
[email protected] &Sebastian Lopez-Cot11footnotemark: 1 ,22footnotemark: 2
[email protected] &Matthew Riemer22footnotemark: 2 ,33footnotemark: 3
[email protected] &Golnaz Habibi11footnotemark: 1 ,22footnotemark: 2
[email protected] &Gerald Tesauro22footnotemark: 2 ,33footnotemark: 3
[email protected] &Sami Mourad22footnotemark: 2 ,33footnotemark: 3
[email protected] &Murray Campbell22footnotemark: 2 ,33footnotemark: 3
[email protected] &Jonathan P. How11footnotemark: 1 ,22footnotemark: 2
[email protected]
LIDS, MITMIT-IBM Watson AI LabIBM ResearchWork was completed while at MIT
Abstract

In cooperative multiagent reinforcement learning, agents commonly acquire heterogeneous knowledge. Learning across the team can be greatly improved if agents can effectively exchange their knowledge to other agents. In particular, recent work showed that action advising, a form of peer-to-peer knowledge transfer from teacher agents to student agents, improves team-wide learning. However, that prior work on action advising only considered advising with primitive (low-level) actions, which limits scalability. This paper introduces a novel learning-to-teach framework, called hierarchical multiagent teaching (HMAT), in which the teacher advice may include extended action sequences over multiple levels of temporal abstraction. The empirical evaluations show that HMAT accelerates team-wide learning progress in environments that are more complex than considered in previous learning-to-teach research. HMAT is also shown to learn teaching policies that can be transferred to different teammates/tasks, even when teammates have heterogeneous action spaces.

\Crefname

figureFigureFigures \crefnamefigureFigureFigures \crefnametableTableTables \crefformattableTable #1 \crefformatequation(#1) \Crefformatequation(#1) \crefformatalgorithmAlgorithm #1 \CrefformatappendixAppendix #1 \crefformatappendixAppendix #1 \usetikzlibrarydecorations.pathreplacing,calc \renewrobustcmd \renewrobustcmd \newrobustcmd\B

 

Learning Hierarchical Teaching in Cooperative Multiagent Reinforcement Learning


  Dong-Ki Kimthanks: LIDS, MIT,thanks: MIT-IBM Watson AI Lab [email protected] Miao Liu22footnotemark: 2 ,thanks: IBM Research [email protected] Shayegan Omidshafiei11footnotemark: 1 ,22footnotemark: 2 ,thanks: Work was completed while at MIT [email protected] Sebastian Lopez-Cot11footnotemark: 1 ,22footnotemark: 2 [email protected] Matthew Riemer22footnotemark: 2 ,33footnotemark: 3 [email protected] Golnaz Habibi11footnotemark: 1 ,22footnotemark: 2 [email protected] Gerald Tesauro22footnotemark: 2 ,33footnotemark: 3 [email protected] Sami Mourad22footnotemark: 2 ,33footnotemark: 3 [email protected] Murray Campbell22footnotemark: 2 ,33footnotemark: 3 [email protected] Jonathan P. How11footnotemark: 1 ,22footnotemark: 2 [email protected]

\@float

noticebox[b]Preprint. Under review.\[email protected]

1 Introduction

In cooperative multiagent reinforcement learning (MARL), agents commonly develop their own knowledge of a domain since they have unique experiences. The history of human social groups provides evidence that the collective intelligence of multiagent populations may be greatly boosted if agents share their learned behaviors with others [26]. With this motivation in mind, we explore new methodologies for allowing multiagent populations of deep RL agents to effectively share their knowledge and learn from other agents while maximizing collective reward.

Recently proposed frameworks allow for various types of knowledge transfer between agents [30]. In this paper, we focus on transfer based on action advising, where an experienced “teacher” agent helps a less experienced “student” agent, by suggesting which action to take next. Action advising allows a student to directly execute suggested actions without incurring much computation overhead. Recent work on action advising includes the Learning to Coordinate and Teach Reinforcement (LeCTR) framework [23], in which agents learn when and what actions to advise. While LeCTR improves upon prior advising methods based on heuristics [1, 4, 5, 31], it faces limitations in scaling to more complicated tasks with high-dimension state-action spaces, long time horizons, and delayed rewards. The key difficulty is teacher credit assignment [23]: learning teacher policies requires estimates of the impact of each piece of advice on the student agent’s learning progress, but these estimates are difficult to obtain. A simple function approximation, such as tile coding, has been used to simplify the learning of teaching policies in LeCTR, but that approach does not scale well.

This paper proposes a new learning-to-teach framework, hierarchical multiagent teaching (HMAT). Specifically, scalability is improved by representing student policies using nonlinear function approximations (e.g., deep neural networks (DNNs)) and hierarchical reinforcement learning (HRL) [29, 14, 21], which allows advising temporally extended sequences of primitive actions. Credit assignment remains a significant issue since DNNs use mini-batches to stabilize learning [11], and mini-batches can be randomly selected from a replay memory [20]. Hence, the student’s learning progress is affected by a batch of advice suggested at varying times and identifying the extent to which each piece of advice contributes to the student’s learning is very challenging. Additional challenges include handling large state-action spaces, long time horizons, and delayed rewards.

The main contribution of this work is a method to resolve the teacher credit assignment with deep student policies: a sequence of advice (i.e., sub-goal plan), instead of single advice, is used to update the student’s policy and a temporary student policy is used to estimate the amount of teacher’s contribution to the student’s learning progress. The second contribution is learning (via HRL) high-level teacher policies that advise students to take high-level actions (e.g., sub-goals). Empirical evaluations demonstrate multiple advantages of HMAT. First, HMAT accelerates the team-wide learning progress for complex tasks compared to previous techniques such as LeCTR. This benefit results from the improved credit assignment determination that helps the teacher learning process; the representation of student policies using DNNs that can handle large state-action spaces; and the high-level advising that addresses long-horizons and delayed rewards. Agents can also learn high-level teaching policies that are transferable to different types of agents and/or tasks. Finally, agents can have different dynamics/action spaces because the high-level teaching enables knowledge transfer that is agnostic to these details.

2 Background

We consider a cooperative MARL setting in which n agents jointly interact in the environment, then receive feedback via local observations and a shared team reward. This setting can be formalized as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP), defined as a tuple ,𝒮,𝓐,𝒯,𝛀,𝒪,,γ [22]; ={1,,n} is the set of agents, 𝒮 is the set of states, 𝓐=×i𝒜i is the set of joint actions, 𝒯 is the transition probability function, 𝛀=×iΩi is the set of joint observations, 𝒪 is the observation probability function, is the reward function, and γ[0,1) is the discount factor. At each timestep t, each agent i executes an action according to its policy atiπi(hti;θi) parameterized by θi, where hti=(o1i,,oti) is the agent i’s observation history at timestep t. Joint action 𝒂t={at1,,atn} yields a transition from current state st𝒮 to next state st+1𝒮 with probability 𝒯(st+1|st,𝒂𝒕). Then, a joint observation 𝒐t+1={ot+11,,ot+1n} is obtained and the team receives a shared reward rt=(st,𝒂t). The agents’ objective is to maximize the expected cumulative reward 𝔼[tγtrt]. The policy parameter will often be omitted and reactive policies are considered (e.g., πi(hti;θi)πi(oti)).

2.1 Learning to Teach in Cooperative MARL

We review key concepts and notations in the learning-to-teach framework.

Task-Level Learning Problem We consider a cooperative MARL setting with two agents i and j in a shared environment. At each learning iteration, agents interact in the environment, collect experiences, and update their policies, πi and πj, with learning algorithms, 𝕃i and 𝕃j. The resulting policies aim to coordinate and optimize final task performance. This problem of learning task-related policies is referred to as the task-level learning problem 𝒫Task.

Advice-Level Learning Problem Throughout task-level learning, even without assuming that any agents are experts, agents may still develop unique skills from their experiences. As such, it is potentially beneficial for agents to advise one another using their specialized knowledge to improve the final performance and accelerate team-wide learning. The problem of learning teacher policies that decide when and what to advise is referred to as the advice-level learning problem, 𝒫~Advise, where ()~ refers to a teacher policy property.

Learning task-level policies and advice-level policies are both RL problems, interleaved within the learning-to-teach framework. However, there are important differences between 𝒫Task and 𝒫~Advise. One difference lies in the definition of learning episodes as rewards about the success of advice are naturally delayed relative to typical task-level rewards. For 𝒫Task, an episode terminates either when agents arrive at a terminal state or a timestep t exceeds a pre-specified value T. In contrast, for 𝒫~Advise, an episode ends when task-level policies have converged, forming one “episode” for learning teaching policies. Upon completion of the advising-level episode, task-level policies are re-initialized and training proceeds for another advising-level episode. To avoid confusion, we refer to an episode as one task-level problem episode and a session as one advice-level problem episode (see creftype 0(a)). Also, 𝒫Task and 𝒫~Advise have different learning objectives. The task-level learning aims to coordinate and maximize cumulative reward per episode, whereas advice-level learning aims to maximize cumulative teacher reward per session, corresponding to accelerating team-wide learning progress (i.e., a maximum area under the learning curve in one session). Lastly, task-level policies are inherently off-policy while teacher policies are not necessarily. This is because task-level policies are updated with experiences affected by teacher policies, instead of experiences generated by following agents’ task-level policies alone.

(a)
(b)
Figure 1: (a) Illustration of task-level learning progress for each session. After completing a pre-determined episode count, task-level policies are re-initialized (teaching session reset). With each new teaching session, teacher policies are better at advising students, leading to faster learning progress for task-level policies. (b) Agents teach each other according to an advising protocol. For example, knowledgeable agent i evaluates the action that agent j intends to take and advises if needed.

2.2 Hierarchical Reinforcement Learning (HRL)

HRL is a structured framework with multi-level reasoning and extended temporal abstraction [24, 29, 8, 14, 2, 33, 25, 7, 34]. HRL efficiently decomposes a complex problem into simpler sub-problems, which offers a benefit over non-HRL in solving difficult tasks with long horizons and delayed reward assignments. The closest HRL framework that we use in this paper is that of [21] with a two-layer hierarchical structure: the higher-level manager policy (π) and the lower-level worker policy (π𝒲). The manager policy obtains an observation ot and plans a high-level sub-goal gtπ(ot) for the worker policy. The worker policy attempts to reach this sub-goal from current state by executing a primitive action atπ𝒲(ot,gt) in the environment. Following this framework, an updated sub-goal is generated by the manager every H timesteps and a sequence of primitive actions are executed by the worker. The manager learns to accomplish a task by optimizing the cumulative environment reward and stores an experience ot,gt,tt+H-1rt,ot+H every H timesteps. By contrast, the worker learns to reach the sub-goal by maximizing the cumulative intrinsic reward rtintrinsic and stores an experience ot,at,rtintrinsic,ot+1 at each timestep. Without loss of generality, we also denote the next observation with the prime symbol (o).

3 Overview of Hierarchical Multiagent Teaching (HMAT)

3.1 Deep hierarchical Task-Level Policy

HMAT addresses the limited scalability of LeCTR by using deep function approximations and learns high-level teacher policies to decide what high-level actions (e.g., sub-goal) to advise fellow agents and when that advice should be given. Specifically, to extend task-level policies with DNNs and hierarchical representations, we replace πi and πj with deep hierarchical policies consisting of manager policies, πi and πj, and worker policies, π𝒲i and π𝒲j (see creftype 0(b)). The manager and worker policies in HMAT are trained with different objectives. Managers learn to accomplish a task together (i.e., solving 𝒫Task) by optimizing cumulative reward, while workers are trained to reach sub-goals suggested by their managers (see creftype C for details).

In this paper, we focus on transferring knowledge at the manager-level instead of the worker-level, since manager policies represent abstract knowledge, which is more relevant to fellow agents. Such a consideration also allows managers to transfer knowledge even if workers have dynamic/action/observation space unique to each agent. Therefore, hereafter when we discuss task-level polices it is implied that we are only discussing the manager policy. The manager subscript will often be omitted when discussing these task-level policies (i.e., πiπi) to simplify notation.

3.2 Advice-Level Learning in Hierarchical Settings

We consider the problem of sharing knowledge between managers via teacher policies, π~i and π~j, which learn when and what sub-goal actions to advise. Consider creftype 0(b), where agents are learning to coordinate with hierarchical task-level policies (solving 𝒫Task) while advising each other via teacher policies (solving 𝒫~Advise). There are two roles in creftype 0(b): that of a student agent j (i.e., an agent whose manager policy receives advice) and that of a teacher agent i (i.e., an agent whose teacher policy gives advice). Note that agents i and j can simultaneously teach each other, but, for clarity, creftype 0(b) only shows a one-way interaction. Here, student j has decided that it is appropriate to strive for sub-goal gj by querying its manager policy. Before j passes the sub-goal gj to its worker, i’s teaching policy checks j’s intended sub-goal and decides whether to advise or not. Having decided to advise, i transforms its task-level knowledge into desirable sub-goal advice via its teacher policy and suggests it to j. After student j accepts the advice from the teacher, the updated sub-goal gj is passed to j’s worker policy, which then generates a primitive action aj.

4 Details of HMAT

4.1 Algorithm of Hierarchical Multiagent Teaching

HMAT iterates over the following three phases to learn how to coordinate with deep hierarchical task-level policies (solving 𝒫Task) and how to provide advice using the teacher policies (solving 𝒫~Advise). These phases are designed to address the teacher credit assignment issue with deep task-level policies. As discussed, identifying which portions of the advice led to successful student learning is difficult (see creftypecap 1). That issue is addressed by adapting ideas developed for learning an exploration policy for a single agent [35], which includes an extended view of actions (see below) and the use of a temporary policy for measuring a reward for the exploration policy. We adapt and extend these ideas from learning-to-explore in a single-agent setting into our learning-to-teach in a multiagent setting. Pseudocode of HMAT is presented in creftype A.

Phase I (Advising Phase)   Agents advise each another using their teacher policies according to the advising protocol during one episode. This process generates a batch of task-level experiences influenced by the teaching policy’s behavior. Specifically, we extend the concept of teacher policies by providing advice in the form of a sequence of multiple sub-goals, 𝒈~0:T={g~0:Ti,g~0:Tj} (where g~0:Ti=(g~0i,g~Hi,,g~Ti) denotes multiple sub-goals in one episode) instead of just providing one piece of advice 𝒈~t={g~ti,g~tj} before updating task-level policies. One teacher action in this extended view corresponds to providing multiple pieces of advice during one episode, which contrasts with previous approaches to teaching in MARL that were updated based on a single piece of advice [1, 4, 5, 23, 31]. This extension is important, because by following advice 𝒈~0:T in an episode, a batch of task-level experiences for agent i and j, 𝑬0:Tadvice={o0:Ti,g~0:Ti,r0:T,o0:Ti,o0:Tj,g~0:Tj,r0:T,o0:Tj}, are generated to enable stable mini-batch updates of the DNN policies.

Phase II (Advice Evaluation Phase)   Learning teacher policies requires reward feedback from the advice in phase I. Phase II evaluates and estimates the impact the advice had on improving team-wide learning progress, yielding the teacher policy’s reward. A temporary task-level policy is used to estimate the teacher reward for 𝒈~0:T, so agents copy their current task-level policies to temporary policies (𝝅temp𝝅). To determine the teacher reward for 𝒈~0:T, 𝝅temp are updated for a small number of iterations using only 𝑬0:Tadvice (i.e., 𝝅tempL(𝝅temp,𝑬0:Tadvice)). The updated temporary policies generate a batch of self-practice experiences 𝑬0:T¯self-prac. by rolling out a total of T¯ timesteps without involving teacher policies. These self-practice experiences, which are based on 𝝅temp reflect how agents (on their own) would perform after the advising phase and can be used to estimate the impact of 𝒈~0:T on the team-wide learning. The function ~ (creftypecap 4.2) uses the self-practice experiences to compute the teacher reward for 𝒈~0:T (i.e., r~=~(𝑬0:T¯self-prac.)). A key point is that the temporary policies 𝝅temp used to compute r~ are only updated based on 𝑬0:Tadvice (i.e., experiences from the past iterations are not utilized), which resolves the teacher credit assignment issue.

Phase III (Policy Update Phase)   Task-level policies are updated to solve 𝒫Task by using the learning algorithms 𝕃i and 𝕃j, and advice-level policies are updated to solve 𝒫~Advise by using the learning algorithms 𝕃~i and 𝕃~j. In particular, task-level policies, πi and πj, are updated for the next iteration by randomly sampling experiences from task-level experience memories, 𝒟i and 𝒟j. As in [35], both 𝑬0:Tadvice and 𝑬0:T¯self-prac. are added to the task-level memories. Similarly, after adding the teacher experience collected from the advising (Phase I) and advice evaluation phases (Phase II) to teacher experience memories, 𝒟~i and 𝒟~j, teacher policies are updated by randomly selecting samples from their replay memories, but at a slower update frequency than those of task-level policies.

4.2 Details of Teacher Policy

We explain important components of the teacher policy (focusing on teacher i for clarity). See creftypecap B for additional details.

Teacher Observation and Action   Teacher-level observations 𝒐~t={o~ti,o~tj} compactly provide information about the nature of the heterogeneous knowledge between the two agents. Given o~ti, teacher i decides when and what to advise, with one action for deciding whether or not it should provide advice and another action for selecting the sub-goal to give as advice. If no advice is provided, student j executes its originally intended sub-goal.

Teacher Reward Function   Teacher policies aim to maximize cumulative teacher reward in a session that should result in faster team-wide learning progress. Recall in Phase II that the batch of self-practice experiences reflect how agents by themselves perform after one advising phase. The open question is how to identify an appropriate teacher reward function ~ that can map self-practice experiences into better learning performance. Intuitively, maximizing the reward returned by a teacher reward function r~=~(𝑬0:T¯self-prac.) means that teachers should advise so that the learning performance is maximized after one advising phase. In this work, we consider a new reward function, called current rollout (CR), which returns the sum of rewards in the self-practice experiences 𝑬0:T¯self-prac.. We also evaluate different choices of teacher reward functions, including the ones in  [23] and [35].

Teacher Experiences   One teacher experience corresponds to E~0:Ti=o~0:Ti,g~0:Ti,r~,o~0:Ti; where o~0:Ti=(o~0i,o~Hi,o~2Hi,,o~Ti) is a teacher observation; g~0:Ti=(g~0i,g~Hi,g~2Hi,,g~Ti) is a teacher action; r~ is an estimated teacher reward with ~; and o~0:Ti=(o~0i,o~Hi,o~2Hi,,o~Ti) is a next teacher observation, obtained by updating o~0:Ti with the updated temporary policy πtempj (i.e., representing the change in student j’s knowledge due to advice g~0:Ti).

4.3 Training Protocol

Task-Level Training   To accommodate inherent off-policy task-level policies (creftypecap 2.1), we use the off-policy TD3 [10] algorithm to train the worker and manager policies. TD3 is an actor-critic algorithm which introduces two critics, Q1 and Q2, to reduce overestimation of Q-value estimate in DDPG [17] and yields more robust learning performance. Originally, TD3 is a single-agent deep RL algorithm accommodating continuous spaces/actions. Here, we extend TD3 to multiagent settings with a resulting algorithm termed MATD3, and non-stationarity in MARL is addressed by applying centralized critics/decentralized actors [9, 18]. Another algorithm termed HMATD3 further extends MATD3 with HRL. In HMATD3, an agent i’s task policy critics, Q1i and Q2i, minimize the following critic loss:

=α=12𝔼𝐨,𝐠,r,𝐨)𝒟i[y-Qαi(𝐨,𝐠)]2,s.t.y=r+γminβ=1,2Qβ,targeti(𝐨,πtarget(𝐨)+ϵ), (1)

where 𝐨={oi,oj}; 𝐠={gi,gj}; 𝐨={oi,oj}; πtarget={πtargeti,πtargetj}; the subscript “target” denotes the target network; and ϵ𝒩(0,σ). The agent i’s actor policy πi with parameter θi is updated by:

θiJ(θi)=𝔼𝐨𝒟i[θiπi(gi|oi)giQ1i(𝐨,𝐠)|𝐠=π(𝐨)]. (2)

Advice-Level Training   TD3 is also used for updating teacher policies. We modify creftype 1 and creftype 2 to account for the teacher’s extended view. Considering agent i for clarity, agent i’s teacher policy critics, Q~1i and Q~2i, minimize the following critic loss:

~ = α=12𝔼𝑬~0:T𝒟~i[𝔼𝑬~𝑬~0:T(y-Q~αi(𝐨~,𝐠~))]2, (3)
s.t.y = r~+γminβ=1,2Q~β,targeti(𝐨~,π~target(𝐨~)+ϵ),𝑬~0:T=𝐨~0:T,𝐠~0:T,r~,𝐨~0:T,𝑬~=𝐨~,𝐠~,r~,𝐨~,

where 𝐨~={o~i,o~j}; 𝐠~={g~i,g~j}; 𝐨~={o~i,o~j}; π~target={π~targeti,π~targetj}. The agent i’s actor policy π~i with parameter θ~i is updated by:

θ~iJ(θ~i)=𝔼𝐨~0:T𝒟~i[𝔼𝐨~𝐨~0:T(z)], (4)

where z=θ~iπ~i(g~i|o~i)g~iQ~1i(𝐨~,𝐠~)|𝐠~=π~(𝐨~).

5 Related Work

Imitation learning studies how to learn a policy from expert demonstrations [6, 28, 27]. Recent work applied imitation learning for multiagent coordination [16] and explored effective combinations of imitation learning and hierarchical RL [15]. Curriculum learning [3, 12, 32], which progressively increases task difficulty, is also relevant. While most work in this topic focuses on learning knowledge for solving single agent problems, we study peer-to-peer knowledge transfer in cooperative MARL. Related work by Xu et al. [35] learns an exploration policy, which relates to our approach to learning teaching policies. Their approach led to unstable learning in our setting, motivating our policy update rule (creftype 4.3). We also consider various teacher reward functions, including the one in [35], and a new reward function of CR empirically performs better in our domains.

6 Evaluation

We demonstrate HMAT’s performance in increasingly challenging domains that involve continuous states/actions, long horizons, and delayed rewards.

(a) Cooperative one box domain.
(b) Cooperative two box domain.
Figure 2: Two scenarios used for evaluating the learning-to-teach framework.

6.1 Evaluation Domains and Tasks

Our domains are based on the OpenAI’s multiagent particle environment that supports continuous observation/action spaces. We modify the environment and propose new domains, called cooperative one/two box push, that require coordinating agents (see creftype C for additional details).

Cooperative One Box Push (COBP)   The domain consists of one round box and two agents (creftype 1(a)). The objective is to move the box to the target on the left side as soon as possible. The box can be moved if and only if two agents act on it together. This unique property requires that the agents coordinate. The domain has a delayed reward because there is no change in reward until the box is moved by the two agents. An episode in COBP ends when t exceeds 50 timesteps.

Cooperative Two Box Push (CTBP)   This domain is similar to the one box domain but with increased complexity. There are two round boxes in the domain (creftype 1(b)). The objective is to move the left box (box1) to the left target (target1) and the right box (box2) to the right target (target2). In addition, the boxes have different mass – box2 is 3x heavier than box1. An episode in CTBP ends when t exceeds 100 timesteps.

Heterogeneous Knowledge   For each domain, we provide each agent with a different set of priors to ensure heterogeneous knowledge between them and motivate interesting teaching scenarios. For the COBP (creftype 1(a)), agent i and j are first trained to move the box to the target. Then, agents i and k are teamed up, and agent k has no knowledge about the domain. Agent i, which understands how to move the box, should teach agent k by giving good advice to improve k’s learning progress. For the CTBP (creftype 1(b)), agents i and j have received prior training about how to move box1 to target1, and agents k and l understand how to move box2 to target2. However, these two teams have different skills as the tasks involve moving boxes with different weights (light vs heavy) and also in different directions (left vs right). Then agents i and k are teamed up, and in this scenario, agent i should transfer its knowledge about moving box1 to agent k. Meanwhile, agent k should teach agent i how to move box2, so that there is a two-way transfer of knowledge where each agent is the primary teacher at one point and primary student at another.

Table 1: V¯ and AUC for different algorithms. Results show a mean and standard deviation computed for 10 sessions. Best results in bold (computed via a t-test with p<0.05).
\renewrobustcmd
Algorithm Hierarchical? Teaching? One Box Push Two Box Push
V¯ AUC V¯ AUC
MATD3 -12.64±3.42 148±105 -44.21±5.01 1833±400
LeCTR–Tile -18.00±0.01 6±0 -60.50±0.05 246±3
LeCTR–D -14.85±2.67 81±80 -46.07±3.00 1739±223
LeCTR–OD -17.92±0.10 17±3 -60.43±0.08 272±26
AI \B-11.33 ± 2.46 162±86 -41.09±3.50 2296±393
AICI \B-10.60 ± 0.85 200±74 -38.23±0.45 2742±317
MAT \B-10.04 ± 0.38 274±38 -39.49±2.93 2608±256
HMATD3 \B-10.24 ± 0.20 427±18 -31.55±3.51 4288±335
LeCTR–HD -12.10±0.94 265±59 -38.87±2.94 2820±404
LeCTR–OHD -16.77±0.66 70±22 -60.23±0.33 306±28
HAI \B-10.23 ± 0.19 427±9 -31.37±3.71 4400±444
HAICI \B-10.25 ± 0.26 433±5 -29.32±1.19 4691±261
HMAT \B-10.10 ± 0.19 \B458 ± 8 \B-27.49 ± 0.96 \B5032 ± 186

6.2 Baselines

We compare to several baselines in order to provide context for the performance of HMAT (see creftypecap D for additional details).

No-Teaching Baselines:   MATD3 and HMATD3 are the baselines for a primitive and hierarchical MARL without teaching, respectively.

LeCTR Baselines:   This includes the LeCTR framework [23] with task-level policies that use tile-coding (LeCTR–Tile). We also consider modifications of that framework in which the task-level policies are learned with deep RL methods: MATD3 (LeCTR–D) and HMATD3 (LeCTR–HD). Lastly, we compare two LeCTR baselines using online MATD3 (LeCTR–OD) and online HMATD3 (LeCTR–OHD), where the online update denotes the policy update with the most recent experience.

Heuristic Teaching Baselines:   Two heuristic-based primitive teaching baselines, Ask Important (AI) and Ask Important–Correct Important (AICI) [1], are compared. In AI, each student asks for advice based on the importance of a state using the student’s Q-values. When asked, the teacher agent always advises with its best primitive action at a given student state. Students in AICI also ask for advice, but each teacher can decide whether to advise with its best action (or not to advise). The teacher decides based on the state importance using the teacher’s Q-values and the difference between the student’s intended action and teacher’s intended action at a student state. AICI is one of the best performing heuristic algorithms in [1]. Hierarchical AI (HAI) and hierarchical AICI (HAICI) are similar to AI and AICI, but teaches in the hierarchical settings (i.e., managers teach each other).

HMAT Variants:   A non-hierarchical variant of HMAT, called MAT, is compared under different choices of teacher reward functions. For clarity, we only show results with the CR reward function. Performance comparisons between different reward functions are shown in creftype B.

(a)
(b)
(c)
Figure 3: (a) and (b) Task-level learning progress in the one box push domain and two box push domain, respectively. The oracles in (a) and (b) refer to performance of converged HMATD3. For fair comparisons, HMAT and MAT include both the number of episodes used in the Phase I and II when counting the number of train episodes. (c) Heterogeneous action AUC based on the action rotation. Mean and 95% confidence interval computed for 10 sessions are shown in all figures.

6.3 Results on One Box and Two Box Push

creftypecap 1 compares HMAT and its baselines. The results show both final task-level performance (V¯) and area under the task-level learning curve (AUC) – higher values are better for both metrics. The results demonstrate improved task-level learning performance with HMAT compared to HMATD3 and with MAT compared to MATD3, as indicated by the higher final performance (V¯) and larger rate of learning (AUC) in creftypeplural 2(b)\crefpairconjunction2(a). HMAT also shows better performance than MAT. These results demonstrate the benefits of the high-level advising, which helps address the long time horizons and delayed rewards in these two domains.

HMAT also achieves better performance than the LeCTR baselines. LeCTR–Tile shows the smallest V¯ and AUC due to the limitations of the tile-coding representation of the policies in these complex domains. Teacher policies in LeCTR–D and LeCTR–HD have poor estimates of the teacher credit assignment with deep task-level policies, that result in unstable learning of the teaching policies and worse performance than the no-teaching baselines (i.e., LeCTR–D vs MATD3, LeCTR–HD vs HMATD3). In contrast, both LeCTR–OD and LeCTR–OHD have good estimates of the teacher credit assignment as the task-level policies are updated online. However, these two approaches suffer from instability caused by the absence of mini-batch update for the DNN policies. Finally, HMAT attains the best performance in terms of V¯ and AUC, compared to the heuristics-based baselines.

These combined results demonstrate the key advantage of HMAT in that it can accelerate the learning progress for complex tasks with continuous states/actions, long horizons, and delayed rewards.

6.4 Results on Transferability and Heterogeneous Action Spaces

HMAT can learn high-level teaching strategies transferable to different types of agents and/or tasks. We evaluate both transferability in the following two perspectives and teaching with heterogeneous actions spaces. Below numerical results are mean and standard deviation computed for 10 sessions.

Transfer across Different Student Types   We first create a small population of students, each having different knowledge. Specifically, we create 7 students that can push one box to distinct areas in the one-box push domain: top-left, top, top-right, right, bottom-right, bottom, and bottom-left. This population is divided into train {top-left, top, bottom-right, bottom, and bottom-left}, validation {top-right}, and test {right} groups. After the teacher policy has converged, we fix the policy and transfer it to a different setting in which the teacher advises a student in the test group. Although the teacher has never interacted with the student in the test group before, it achieves an AUC of 400±10, compared to no-teaching baseline (HMATD3) AUC of 369±27.

Transfer across Different Tasks   We first train the teacher in the one box push domain that learns to transfer knowledge to agent k about how to move the box to the left. Then we fix the converged teacher policy and evaluate on a different task of moving the box to right. While task-level learning without teaching achieves AUC of 363±43, task-level learning with teaching achieves AUC of 414±11. Thus, learning is faster, even when using pre-trained teacher policies from different tasks.

Teaching with Heterogeneous Action Spaces   We consider heterogeneous action space variants in the one-box push domain, where agent k has a remapped primitive action space (e.g., 180° rotation) compared to its teammate i. The original heuristic-based algorithms advise at primitive actions and thus assume the action space homogeneity. As creftype 2(c) shows, AICI has almost 50% decreased mean AUC when the action space is remapped. Advising with HRL helps to teach a teammate with a heterogeneous action space, and HMAT achieves the best performance regardless of action rotation.

7 Conclusion

The paper presents HMAT, which utilizes deep function approximations and HRL to transfer knowledge between agents in cooperative MARL problems. We propose a method to overcome the teacher credit assignment issue and show accelerated learning progress in challenging domains. Ongoing work is expanding the framework to more than two agents.

Acknowledgements

This work was supported by IBM (as part of the MIT-IBM Watson AI Lab initiative) and AWS Machine Learning Research Awards program. Dong-Ki Kim was also supported by Kwanjeong Educational Foundation Fellowship.

References

  • [1] O. Amir, E. Kamar, A. Kolobov, and B. J. Grosz. Interactive teaching strategies for agent training. International Joint Conferences on Artificial Intelligence (IJCAI), 2016.
  • [2] P. Bacon, J. Harb, and D. Precup. The option-critic architecture. In Association for the Advancement of Artificial Intelligence (AAAI), pages 1726–1734, 2017.
  • [3] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.
  • [4] J. A. Clouse. On integrating apprentice learning and reinforcement learning. 1996.
  • [5] F. L. da Silva, R. Glatt, and A. H. R. Costa. Simultaneously learning and advising in multiagent reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 1100–1108, 2017.
  • [6] M. Daswani, P. Sunehag, and M. Hutter. Reinforcement learning with value advice. In D. Phung and H. Li, editors, Proceedings of the Sixth Asian Conference on Machine Learning, volume 39 of Proceedings of Machine Learning Research, pages 299–314. PMLR, 26–28 Nov 2015.
  • [7] P. Dayan and G. E. Hinton. Feudal reinforcement learning. In Advances in neural information processing systems, pages 271–278, 1993.
  • [8] T. G. Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. J. Artif. Int. Res., 13(1):227–303, Nov. 2000.
  • [9] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson. Counterfactual multi-agent policy gradients. arXiv preprint arXiv:1705.08926, 2017.
  • [10] S. Fujimoto, H. van Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pages 1587–1596, 2018.
  • [11] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
  • [12] A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu. Automated curriculum learning for neural networks. CoRR, abs/1704.03003, 2017.
  • [13] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  • [14] T. D. Kulkarni, K. R. Narasimhan, A. Saeedi, and J. B. Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 3682–3690, 2016.
  • [15] H. M. Le, N. Jiang, A. Agarwal, M. Dudík, Y. Yue, and H. Daumé III. Hierarchical imitation and reinforcement learning. arXiv preprint arXiv:1803.00590, 2018.
  • [16] H. M. Le, Y. Yue, P. Carr, and P. Lucey. Coordinated multi-agent imitation learning. In International Conference on Machine Learning, pages 1995–2003, 2017.
  • [17] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015.
  • [18] R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pages 6382–6393, 2017.
  • [19] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pages 1928–1937. PMLR, 2016.
  • [20] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, pages 529–533, 2015.
  • [21] O. Nachum, S. Gu, H. Lee, and S. Levine. Data-efficient hierarchical reinforcement learning. CoRR, abs/1805.08296, 2018.
  • [22] F. A. Oliehoek and C. Amato. A Concise Introduction to Decentralized POMDPs. SpringerBriefs in Intelligent Systems. Springer, May 2016.
  • [23] S. Omidshafiei, D. Kim, M. Liu, G. Tesauro, M. Riemer, C. Amato, M. Campbell, and J. P. How. Learning to teach in cooperative multiagent reinforcement learning. CoRR, abs/1805.07830, 2018.
  • [24] R. Parr and S. Russell. Reinforcement learning with hierarchies of machines. In Proceedings of the 1997 Conference on Advances in Neural Information Processing Systems 10, pages 1043–1049, 1998.
  • [25] M. Riemer, M. Liu, and G. Tesauro. Learning abstract options. In Advances in Neural Information Processing Systems, pages 10445–10455, 2018.
  • [26] E. M. Rogers. Diffusion of innovations. Simon and Schuster, 2010.
  • [27] S. Ross and J. A. Bagnell. Reinforcement and imitation learning via interactive no-regret learning. CoRR, abs/1406.5979, 2014.
  • [28] S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15, pages 627–635. PMLR, 2011.
  • [29] R. S. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1):181 – 211, 1999.
  • [30] M. E. Taylor and P. Stone. Transfer learning for reinforcement learning domains: A survey. J. Mach. Learn. Res., 10:1633–1685, Dec. 2009.
  • [31] L. Torrey and M. Taylor. Teaching on a budget: Agents advising agents in reinforcement learning. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, pages 1053–1060. International Foundation for Autonomous Agents and Multiagent Systems, 2013.
  • [32] Y. Tsvetkov, M. Faruqui, W. Ling, B. MacWhinney, and C. Dyer. Learning the curriculum with bayesian optimization for task-specific word representation learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 130–139. Association for Computational Linguistics, 2016.
  • [33] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. FeUdal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 3540–3549. PMLR, 2017.
  • [34] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017.
  • [35] T. Xu, Q. Liu, L. Zhao, and J. Peng. Learning to explore with meta-policy gradient. CoRR, abs/1803.05044, 2018.

Appendix A HMAT Pseudocode

Algorithm 1 HMAT Pseudocode
{algorithmic}

[1] \RequireMaximum number of episodes in session S \RequireTeacher update frequency fteacher \StateInitialize advice-level policies 𝝅~ and memories 𝑫~ \Forteaching session \StateRe\tikz[overlay,remember picture] \node(left) ;-initialize task-level policy parameters 𝝅 \StateRe-initialize task-level memories 𝑫 \StateRe-initialize train episode count: e=0 \WhileeS \State\tikz[overlay,remember picture] \node(top1) ;𝑬0:Tadvice,𝒐~0:T,𝒈~0:T Teacher’s advice \StateUpdate episode count: ee+1\tikz[overlay,remember picture] \node(bottom1) ; \State\tikz[overlay,remember picture] \node(top2) ;Copy temporary task-level policies: 𝝅temp𝝅 \StateUpdate to 𝝅temp using Eqn creftype 1creftype 2 with 𝑬0:Tadvice \State𝑬0:T¯self-prac. 𝝅temp perform self-practice \StateUpdate episode count: ee+T¯/T \State\tikz[overlay,remember picture] \node(bottom2) ; r~ Get teacher reward with ~ \State\tikz[overlay,remember picture] \node(top3) ;Add 𝑬0:Tadvice and 𝑬0:T¯self-prac. to 𝑫 \StateAdd a teacher experience 𝑬~0:T to 𝑫~ \StateUpdate 𝝅 using Eqn creftype 1creftype 2 with 𝑫 \Ife mod fteacher==0 \StateUpdate 𝝅~ using Eqn (3)–creftype 4 with 𝑫~ \EndIf\tikz[overlay,remember picture] \node(bottom3) ; \EndWhile\EndFor {tikzpicture}[overlay, remember picture] \draw[decoration=brace,mirror,amplitude=0.5em,decorate,thick,black] ((left)!(top1.north)!((left)-(0,1))) – ((left)!(bottom1.south)!((left)-(0,1))) node[align=center,text width=2.5cm,pos=0.5,anchor=east,xshift=0.75cm] Phase
I; {tikzpicture}[overlay, remember picture] \draw[decoration=brace,mirror,amplitude=0.5em,decorate,thick,black] ((left)!(top2.north)!((left)-(0,1))) – ((left)!(bottom2.south)!((left)-(0,1))) node[align=center,text width=2.5cm,pos=0.5,anchor=east,xshift=0.75cm] Phase
II; {tikzpicture}[overlay, remember picture] \draw[decoration=brace,mirror,amplitude=0.5em,decorate,thick,black] ((left)!(top3.north)!((left)-(0,1))) – ((left)!(bottom3.south)!((left)-(0,1))) node[align=center,text width=2.5cm,pos=0.5,anchor=east,xshift=0.75cm] Phase
III;

Appendix B Additional Details of Teacher Policy

B.1 Teacher Observation

While agents can simultaneously advise one another, for clarity, we detail the teacher observation when agents i and j are the teacher and student agent, respectively. For agent i’s teacher policy, its observation o~ti consists of:

o~ti={𝒐t,gti,gtij,Qi(𝒐t,gti,gtj),Qi(𝒐t,gti,gtij)Teacher Knowledge,gtj,Qj(𝒐t,gti,gtj),Qj(𝒐t,gti,gtij)Student Knowledge,
RPhase I,RPhase II,tRemainMisc.},

where 𝒐t={oti,otj}; gtiπi(oti); gtjπj(otj); gtijπi(otj); Qi, Qj are the centralized critics for agent i and j, respectively; RPhase I and RPhase II are average rewards in the last few iterations in Phase I and Phase II, respectively; and tRemain is the remaining time in current session.

B.2 Teacher Reward Function

Recall in the advice evaluation phase of HMAT (Phase II) that the batch of self-practice experiences 𝑬0:T¯self-prac. reflects how agents by themselves perform after one advising phase. Then, the next important question is to identify an appropriate teacher reward function ~ that can transform the self-practice experiences into the learning performance. In this section, we detail our choices of teacher reward functions, including the ones in [23] and [35], as described in creftype 2.

Table 2: Summary of teacher reward functions ~. Rollout reward R^ denotes the sum of rewards in the self-practice experiences ~(𝑬0:T¯self-prac.)
Teacher Reward Name                                   Description Teacher Reward r~i
VEG: Value Estimation Gain Student’s Q-value above threshold τ [23] 𝟙(Qstudent>τ)
DR: Difference Rollout Difference in rollout reward before/after advising [35] R^-R^before
CR: Current Rollout Rollout reward after advising phase R^

Value Estimation Gain (VEG)   The value estimation gain (VEG) teacher reward function is introduced and performed the best in the Learning to Coordinate and Teach Reinforcement (LeCTR) framework [23]. VEG rewards a teacher when its student’s Q-value exceeds a threshold τ: 𝟙(Qstudent>τ). The motivation of VEG is that the student’s Q-value is correlated to its estimated learning progress [23]. creftypecap 2 describes how teachers receive the VEG teacher reward. Note that the Q-functions in the algorithm correspond to the ones of the updated temporary task-level policies 𝝅temp. The threshold values of -6.0 and -20.0 are used for the one-box and two-box push domain, respectively.

Algorithm 2 VEG Reward Pseudocode
{algorithmic}

[1] \RequireSelf-practice experiences 𝑬0:T¯self-prac. \RequireThreshold τ \StateInitialize agent i’s teacher reward r~i=0 \StateInitialize agent j’s teacher reward r~j=0 \For𝐨,𝐠,𝐫,𝐨𝑬0:T¯self-prac. \IfQj(𝐨,𝐠)>τ \Stater~ir~i+1 \EndIf\IfQi(𝐨,𝐠)>τ \Stater~jr~j+1 \EndIf\EndFor\Stater~=r~i+r~j

Difference Rollout (DR)   The difference rollout (DR) teacher reward function is used in the work of Xu et al. [35] for learning an exploration policy for a single agent. DR requires another self-practice experiences collected before the advising phase 𝑬0:T¯self-prac., before. Then, the teacher reward is calculated by the difference between the sum of rewards in 𝑬0:T¯self-prac. (denoted by R^) and the sum of rewards in 𝑬0:T¯self-prac., before (denoted by R^before).

Current Rollout (CR)   The current rollout (CR) is a simpler reward function than DR. CR returns the sum of rewards in the self-practice experiences 𝑬0:T¯self-prac..

Performance Comparisons   Our empirical experiments in creftype 3 show that CR performs the best with HMAT. Consistent with this observation, the learning progress estimated with CR has a very high correlation with the true learning progress (see creftype E).

B.3 Teacher Experience

In this section, we detail how the next teacher observation o~0:Ti in the teacher experience E~0:Ti=o~0:Ti,g~0:Ti,r~,o~0:Ti is obtained (focusing on teacher i for clarity). To get the next teacher observation, the teacher’s observation o~0:Ti is updated by replacing student j’s Q-values (see creftype B.1) with the the student j’s updated temporary policy πtempj (i.e., representing the change in student j’s knowledge due to advice g~0:Ti). RPhase I, RPhase II, and tRemain are also updated in the next teacher observation.

Table 3: V¯ and AUC for different reward functions. Results show a mean and standard deviation computed for 10 sessions. Best results in bold (computed via a t-test with p<0.05).
\renewrobustcmd
Algorithm Hierarchical? Teaching? One Box Push Two Box Push
V¯ AUC V¯ AUC
HMAT (with VEG) -10.38 ± 0.25 424±28 -29.73 ± 2.89 4694 ± 366
HMAT (with DR) -10.36 ± 0.32 417±31 \B-28.12 ± 1.58 4758 ± 286
HMAT (with CR) \B-10.10 ± 0.19 \B458 ± 8 \B-27.49 ± 0.96 \B5032 ± 186

Appendix C Additional Experimental/Domain Details

C.1 Implementation Details

Each policy’s actor and critic are two-layer feed-forward neural networks, consisting of the rectified linear unit (ReLU) activations. In terms of the task-level actor policies, a final layer of the tanh activation that outputs between -1 and 1 is used at the output. Actors for the worker and primitive task-level policies output two actions that correspond to x–y forces to move in the one box and two box domains. Similarly, actors for the manager task-level policies output two actions, but they correspond to sub-goals of (x, y) coordinate.

HMAT’s teacher-level actor policies output four actions, where the first two outputs correspond to what to advise (i.e., continuous sub-goal advice of (x, y) coordinate) and the last two outputs correspond to when to advise (i.e., discrete actions converted to the one-hot encoding). Two separate final layers of the tanh activation and the linear activation are applied to what to advise actions and when to advise actions, respectively. Note that the Gumbel-Softmax estimator [13] is used to compute gradients for the discrete actions of when to advise.

Because our focus is teaching at the manager-level, not at the worker-level, we pre-train and fix the worker policies by giving randomly generated sub-goals. Per Section 2.2, the intrinsic reward function to pre-train the worker policy is the negative distance between the current position and sub-goal: rtintrinsic=-||ot-gt||22, where ot and gt correspond to the current position and sub-goal, respectively. All hierarchical methods presented in this work use pre-trained workers.

The Amazon Elastic Compute Cloud (EC2) of C5 and P3 instances are used for the computing infrastructures.

C.2 Cooperative One-Box Push Domain

  • Each agent’s observation include its position/speed, the position of the box, the position of the left target, and the other agent’s position.

  • Every episode, the box and the left target initialize at (-0.25,0.00) and (-0.85,0.00), respectively. Agents initialize at random locations.

  • The domain returns a team reward rt=-||loc(left target)-loc(box)||22.

  • The box and the agents have a radius of 0.25 and 0.10, respectively.

  • The width and height of the domain are from -1 to 1.

  • Maximum timestep per episode T is 50 timesteps.

  • All methods use discount factor of γ=0.99.

  • The maximum number of episodes in a session S is 600 episodes.

  • Adam optimizer; actor learning rate of 0.0001 and critic learning rate of 0.001.

  • As in TD3 [10], task-level policies for all methods use first 2000 timesteps for a purely random exploration followed by an added Gaussian noise. The added Gaussian noise has mean of 0 and a standard deviation of 0.10. Empirically, adding the purely random exploration results in more stable results.

  • Managers in all hierarchical methods generate sub-goal every H=5 timesteps.

  • Self-practice rollout timestep T¯ is 100 timesteps (two episodes).

  • Teacher update frequency fteacher of {15,30,45} episodes.

  • Task-level batch size 50; Teacher-level batch size of {16,32,64}.

  • We focus on knowledge transfer from agent i to agent k in this domain. Recall that agent i already understands how to move the box to the left side. Therefore, we fix agent i’s task-level policy’s weight, but train agent k task-level policy throughout the experiments.

C.3 Cooperative Two-Box Push Domain

  • Each agent’s observation include its position/speed, the positions of the boxes, the positions of the targets, and the other agent’s position.

  • Every episode, the two boxes initialize at (-0.30,0.00) and (0.30,0.00), respectively. The two targets initialize at (-0.85,0.00) and (0.85,0.00), respectively. Agents reset at random locations.

  • The boxes have a radius of 0.25. The agents have a radius of 0.10.

  • The domain returns a team reward
    rt=-{||loc(left target)-loc(left box)||22+||loc(right target)-loc(right box)||22}.

  • The left box has a mass of 2 and the right box has a mass of 6.

  • The width and height of the domain are from -1 to 1.

  • Maximum timestep per episode T is 100 timesteps.

  • All methods use discount factor of γ=0.99.

  • The maximum number of episodes in a session S is 1800 episodes.

  • Adam optimizer; actor learning rate of 0.0001 and critic learning rate of 0.001.

  • Task-level policies for all methods use first 2000 timesteps for a purely random exploration followed by an added Gaussian noise. The added Gaussian noise has mean of 0 and a standard deviation of 0.10.

  • Managers in all hierarchical methods generate sub-goal every H=5 timesteps.

  • Self-practice rollout timestep T¯ is 200 timesteps (two episodes).

  • Teacher update frequency fteacher of {15,30,45} episodes.

  • Task-level batch size 50; Teacher-level batch size of {16,32,64}.

  • Both agent i and agent k’s task-level policies are trained.

Appendix D Baseline Details

D.1 LeCTR Baseline Details

In terms of the original LeCTR framework (LeCTR–Tile), various parameters related to the tile-coding task-level policies are evaluated. Specifically, we attempt various combinations: the hash size values of {1024,4096,16384}; number of tiling of {16,32,64,256}; number of tiles per tiling of {5,10,20}; and γ of {0.90.0.95.0.99}.

D.2 Heuristic-Based Teaching Baseline Details

Heuristic-based teaching methods [1, 4, 5, 31] use various heuristics to decide when to advise. When decided to advise, a teacher agent advises with its best action at a student state/observation. We compare the Ask Important (AI) heuristic [1] and the Ask Important–Correct Important (AICI) heuristic [1] in our experiments. We also consider when AI and AICI methods are applied in the hierarchical settings: hierarchical AI (HAI) and hierarchical AICI (HAICI). As heuristic-based teaching approaches are based on tile-coding/discrete action space, we apply minor modifications to these approaches to combine with MATD3 or HMATD3. In this section, we detail our modifications to AI, AICI, HAI, and HAICI.

AI and HAI   In AI, a student agent j asks for an advice when the importance of a state Istudent is larger than a threshold τstudent: Istudent=maxaQstudent(oj,a)-minaQstudent(oj,a)>τstudent. Istudent represents an important state when receiving an advice can yield a significantly better performance, as determined by the student agent’s Q-function. We make the following modifications to measure Istudent with MATD3. First, we uniformly sample 20 actions to effectively represent all possible actions. Second, Istudent is modified to include the centralized critics. Specifically, considering the teacher agent i and the student agent j for clarity, the student j asks for advice when: Istudent=maxaA^Qj(oi,oj,ai,a)-minaA^Qj(oi,oj,ai,a)>τstudent; where A^ denotes the set of sampled 20 actions; and aiπi(oi). Several τstudent values of {0.3,0.6,0.9} are evaluated, and AI performs the best with the value of 0.6.

HAI is similar to AI, except the hierarchical settings. Specifically, in HAI, a teacher agent gives the sub-goal advice, instead of the primitive action advice, and larger values of {1.5,3.0,4.5} for τstudent are evaluated to account the manager’s sub-goal generation frequency of every H=5 timesteps. HAI performs the best with the value of 3.0.

AICI and HAICI   As in AI, students in AICI ask for advice when the Istudent is larger than a threshold τstudent. When asked, however, teachers can decide whether to give advice or not based on Iteacher and the difference between the student’s intended action and teacher’s intended action at student state. Specifically, when asked, teacher agent i decides to advise student j if Iteacher=maxaQteacher(oj,a)-minaQteacher(oj,a)>τteacher and ajπj(oj)aijπi(oj). Similar to AI, minor modifications are made for AICI to advise MATD3 task-level policies. Student j asks for an advice as in AI, but teacher i in AICI decides to advise when: Iteacher=maxaA^Qi(oi,oj,ai,a)-minaA^Qi(oi,oj,ai,a)>τteacher and ||aj-aij||22>τaction; where A^ denotes the set of sampled 20 actions. Combinations of τstudent values of {0.3,0.6,0.9} and τteacher values of {0.3,0.6,0.9} are evaluated with τaction of 0.3. AICI performs the best with τstudent value of 0.3 and τteacher value of 0.6.

Similarly for HAICI, combinations of τstudent values of {1.5,3.0,4.5} and τteacher values of {1.5,3.0,4.5} are evaluated with τaction of 0.3. HAICI performs the best with τstudent value of 1.5 and τteacher value of 3.0.

Figure 4: Ground-truth learning progress vs estimated learning progress (with CR). The CR teacher reward function well estimates the true learning progress with a high correlation of 0.946.
Figure 5: Teacher actor loss between different number of threads. With increasing number of threads, a teacher policy converges faster.

Appendix E Analysis of Teacher Reward Function

Developing ground-truth learning progress of task-level policies often requires an expert policy and could be computationally undesirable [12]. Thus, our framework uses an estimation of the learning progress as a teacher reward. However, it is important to understand how close the estimation is to the true learning progress. The goal of teacher policies is to maximize the cumulative teacher reward, so a poor estimate of teaching reward would result in learning undesirable teachers. In this section, we aim to measure the differences between the true and estimated learning progress and analyze the CR teacher reward function, which performed the best.

In imitation learning, with an assumption of a given expert, one method to measure the true learning progress is by measuring the distance between an action of a learning agent and an optimal action of an expert [27, 6]. Similarly, we pre-train expert policies (HMATD3) and measure the true learning progress by the action differences. The comparison between the true and the estimated learning progress using the CR teacher reward function is shown in creftype 4. The Pearson correlation is 0.946, which empirically shows that the CR teacher reward function well estimates the learning progress.

Appendix F Asynchronous Hierarchical Multiagent Teaching

Similar to a deep RL algorithm requiring millions of episodes to learn a useful policy [20], our teacher policies would require many sessions to learn. As one session consists of many episodes, much time might be needed until teacher policies converge. We address this potential issue with asynchronous policy update with multi-threading as in asynchronous advantage actor-critic (A3C) [19]. A3C demonstrated a reduction in training time that is roughly linear in the number of threads. We also show that our HMAT variant, asynchronous HMAT, achieves a roughly linear reduction in training time as a function of the number of threads (see creftype 5).