Meta-Inverse Reinforcement Learning with Probabilistic Context Variables

  • 2019-10-26 21:15:42
  • Lantao Yu, Tianhe Yu, Chelsea Finn, Stefano Ermon
  • 0


Providing a suitable reward function to reinforcement learning can bedifficult in many real world applications. While inverse reinforcement learning(IRL) holds promise for automatically learning reward functions fromdemonstrations, several major challenges remain. First, existing IRL methodslearn reward functions from scratch, requiring large numbers of demonstrationsto correctly infer the reward for each task the agent may need to perform.Second, existing methods typically assume homogeneous demonstrations for asingle behavior or task, while in practice, it might be easier to collectdatasets of heterogeneous but related behaviors. To this end, we propose a deeplatent variable model that is capable of learning rewards from demonstrationsof distinct but related tasks in an unsupervised way. Critically, our model caninfer rewards for new, structurally-similar tasks from a single demonstration.Our experiments on multiple continuous control tasks demonstrate theeffectiveness of our approach compared to state-of-the-art imitation andinverse reinforcement learning methods.


Quick Read (beta)

Meta-Inverse Reinforcement Learning with Probabilistic Context Variables

Lantao Yu  , Tianhe Yu*, Chelsea Finn, Stefano Ermon
Department of Computer Science, Stanford University
Stanford, CA 94305
Equal contribution.

Providing a suitable reward function to reinforcement learning can be difficult in many real world applications. While inverse reinforcement learning (IRL) holds promise for automatically learning reward functions from demonstrations, several major challenges remain. First, existing IRL methods learn reward functions from scratch, requiring large numbers of demonstrations to correctly infer the reward for each task the agent may need to perform. Second, existing methods typically assume homogeneous demonstrations for a single behavior or task, while in practice, it might be easier to collect datasets of heterogeneous but related behaviors. To this end, we propose a deep latent variable model that is capable of learning rewards from demonstrations of distinct but related tasks in an unsupervised way. Critically, our model can infer rewards for new, structurally-similar tasks from a single demonstration. Our experiments on multiple continuous control tasks demonstrate the effectiveness of our approach compared to state-of-the-art imitation and inverse reinforcement learning methods.


Meta-Inverse Reinforcement Learning with Probabilistic Context Variables

  Lantao Yuthanks: Equal contribution.  , Tianhe Yu*, Chelsea Finn, Stefano Ermon Department of Computer Science, Stanford University Stanford, CA 94305 {lantaoyu,tianheyu,cbfinn,ermon}


noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\[email protected]

1 Introduction

While reinforcement learning (RL) has been successfully applied to a range of decision-making and control tasks in the real world, it relies on a key assumption: having access to a well-defined reward function that measures progress towards the completion of the task. Although it can be straightforward to provide a high-level description of success conditions for a task, existing RL algorithms usually require a more informative signal to expedite exploration and learn complex behaviors in a reasonable time. While reward functions can be hand-specified, reward engineering can require significant human effort. Moreover, for many real-world tasks, it can be challenging to manually design reward functions that actually benefit RL training, and reward mis-specification can hamper autonomous learning Amodei et al. (2016).

Learning from demonstrations Schaal et al. (2003) sidesteps the reward specification problem by instead learning directly from expert demonstrations, which can be obtained through teleoperation Zhang et al. (2017) or from humans experts Yu et al. (2018b). Demonstrations can often be easier to provide than rewards, as humans can complete many real-world tasks quite efficiently. Two major methodologies of learning from demonstrations include imitation learning and inverse reinforcement learning. Imitation learning is simple and often exhibits good performance Zhang et al. (2017); Ho and Ermon (2016). However, it lacks the ability to transfer learned policies to new settings where the task specification remains the same but the underlying environment dynamics change. As the reward function is often considered as the most succinct, robust and transferable representation of a task (Abbeel and Ng, 2004; Fu et al., 2017), the problem of inferring reward functions from expert demonstrations, i.e. inverse RL (IRL) Ng and Russell (2000), is important to consider.

While appealing, IRL still typically relies on large amounts of high-quality expert data, and it can be prohibitively expensive to collect demonstrations that cover all kinds of variations in the wild (e.g. opening all kinds of doors or navigating to all possible target positions). As a result, these methods are data-inefficient, particularly when learning rewards for individual tasks in isolation, starting from scratch. On the other hand, meta-learning Schmidhuber (1987); Bengio et al. (1991), also known as learning to learn, seeks to exploit the structural similarity among a distribution of tasks and optimizes for rapid adaptation to unknown settings with a limited amount of data. As the reward function is able to succinctly capture the structure of a reinforcement learning task, e.g. the goal to achieve, it is promising to develop methods that can quickly infer the structure of a new task, i.e. its reward, and train a policy to adapt to it. Xu et al. (2018) and Gleave and Habryka (2018) have proposed approaches that combine IRL and gradient-based meta-learning Finn et al. (2017a), which provide promising results on deriving generalizable reward functions. However, they have been limited to tabular MDPs Xu et al. (2018) or settings with provided task distributions Gleave and Habryka (2018), which are challenging to gather in real-world applications.

The primary contribution of this paper is a new framework, termed Probabilistic Embeddings for Meta-Inverse Reinforcement Learning (PEMIRL), which enables meta-learning of rewards from unstructured multi-task demonstrations. In particular, PEMIRL combines and integrates ideas from context-based meta-learning Duan et al. (2016); Rakelly et al. (2019), deep latent variable generative models Kingma and Welling (2013), and maximum entropy inverse RL (Ziebart et al., 2008; Ziebart, 2010), into a unified graphical model (see Figure 4 in Appendix D) that bridges the gap between few-shot reward inference and learning from unstructured, heterogeneous demonstrations. PEMIRL can learn robust reward functions that generalize to new tasks with a single demonstration on complex domains with continuous state-action spaces, while meta-training on a set of unstructured demonstrations without specified task groupings or labeling for each demonstration. Our experiment results on various continuous control tasks including Point-Maze, Ant, Sweeper, and Sawyer Pusher demonstrate the effectiveness and scalability of our method.

2 Preliminaries

Markov Decision Process (MDP). A discrete-time finite-horizon MDP is defined by a tuple (T,𝒮,𝒜,P,r,η), where T is the time horizon; 𝒮 is the state space; 𝒜 is the action space; P:𝒮×𝒜×𝒮[0,1] describes the (stochastic) transition process between states; r:𝒮×𝒜 is a bounded reward function; η𝒫(𝒮) specifies the initial state distribution, where 𝒫(𝒮) denotes the set of probability distributions over the state space 𝒮. We use τ to denote a trajectory, i.e. a sequence of state action pairs for one episode. We also use ρπ(st) and ρπ(st,at) to denote the state and state-action marginal distribution encountered when executing a policy π(at|st).

Maximum Entropy Inverse Reinforcement Learning (MaxEnt IRL). The maximum entropy reinforcement learning (MaxEnt RL) objective is defined as:

maxπt=1T𝔼(st,at)ρπ[r(st,at)+α(π(|st))] (1)

which augments the reward function with a causal entropy regularization term (π)=𝔼π[-logπ(a|s)]. Here α is an optional parameter to control the relative importance of reward and entropy. For notational simplicity, without loss of generality, in the following we will assume α=1. Given some expert policy πE that is obtained by above MaxEnt RL procedure , the MaxEnt IRL framework (Ziebart et al., 2008) aims to find a reward function that rationalizes the expert behaviors, which can be interpreted as solving the following maximum likelihood estimation (MLE) problem:

pθ(τ)[η(s1)t=1TP(st+1|st,at)]exp(t=1Trθ(st,at))=pθ¯(τ) (2)

Here, θ is the parameter of the reward function and Zθ is the partition function, i.e. pθ¯(τ)𝑑τ, an integral over all possible trajectories consistent with the environment dynamics. Zθ is intractable to compute when state-action spaces are large or continuous, or environment dynamics are unknown.

Finn et al. (2016) and Fu et al. (2017) proposed the adversarial IRL (AIRL) framework as an efficient sampling-based approximation to MaxEnt IRL, which resembles Generative Adversarial Networks (Goodfellow et al., 2014). Specially, in AIRL, there is a discriminator Dθ (a binary classifier) parametrized by θ and an adaptive sampler πω (a policy) parametrized by ω. The discriminator takes a particular form: Dθ(s,a)=exp(fθ(s,a))/(exp(fθ(s,a))+πω(a|s)) , where fθ(s,a) is the learned reward function and πω(a|s) is pre-computed as an input to the discriminator. The discriminator is trained to distinguish between the trajectories sampled from the expert and the adaptive sampler; while the adaptive sampler πω(a|s) is trained to maximize 𝔼ρπω[logDθ(s,a)-log(1-Dθ(s,a))], which is equivalent to maximizing the following entropy regularized policy objective (with fθ(s,a) serving as the reward function):

𝔼πω[t=1Tlog(Dθ(st,at))-log(1-Dθ(st,at))]=𝔼πω[t=1Tfθ(st,at)-logπω(at|st)] (3)

Under certain conditions, it can be shown that the learned reward function will recover the ground-truth reward up to a constant (Theorem C.1 in Fu et al. (2017)).

3 Probabilistic Embeddings for Meta-Inverse Reinforcement Learning

3.1 Problem Statement

Before defining our meta-inverse reinforcement learning problem (Meta-IRL), we first define the concept of optimal context-conditional policy.

We start by generalizing the notion of MDP with a probabilistic context variable denoted as m, where is the (discrete or continuous) value space of m. For example, in a navigation task, the context variables could represent different goal positions in the environment. Now, each component of the MDP has an additional dependency on the context variable m. For example, by slightly overloading the notation, the reward function is now defined as r:𝒮×𝒜×. For simplicity, the state space, action space, initial state distribution and transition dynamics are often assumed to be independent of m (Duan et al., 2016; Finn et al., 2017a), which we will follow in this work. Intuitively, different m’s correspond to different tasks with shared structures.

Given above definitions, the context-conditional trajectory distribution induced by a context-conditional policy π:𝒮×𝒫(𝒜) can be written as:

pπ(τ={𝒔1:T,𝒂1:T}|m)=η(s1)t=1Tπ(at|st,m)P(st+1|st,at) (4)

Let p(m) denote the prior distribution of the latent context variable (which is a part of the problem definition). With the conditional distribution defined above, the optimal entropy-regularized context-conditional policy is defined as:

π*=argmaxπ𝔼mp(m),(𝒔1:T,𝒂1:T)pπ(|m)[t=1Tr(st,at,m)-logπ(at|st,m)] (5)

Now, let us introduce the problem of Meta-IRL from heterogeneous multi-task demonstration data. Suppose there is some ground-truth reward function r(s,a,m) and a corresponding expert policy πE(at|st,m) obtained by solving the optimization problem defined in Equation (5). Given a set of demonstrations i.i.d. sampled from the induced marginal distribution pπE(τ)=p(m)pπE(τ|m)𝑑m, the goal is to meta-learn an inference model q(m|τ) and a reward function f(s,a,m), such that given some new demonstration τE generated by sampling mp(m),τEpπE(τ|m), with m^ being inferred as m^q(m|τE), the learned reward function f(s,a,m^) and the ground-truth reward r(s,a,m) will induce the same set of optimal policies Ng et al. (1999).

Critically, we assume no knowledge of the prior task distribution p(m), the latent context variable m associated with each demonstration, nor the transition dynamics P(st+1|st,at) during meta-training. Note that the entire supervision comes from the provided unstructured demonstrations, which means we also do not assume further interactions with the experts as in Ross et al. (2011).

3.2 Meta-IRL with Mutual Information Regularization over Context Variables

Under the framework of MaxEnt IRL, we first parametrize the context variable inference model qψ(m|τ) and the reward function fθ(s,a,m) (where the input m is inferred by qψ), The induced θ-parametrized trajectory distribution is given by:

pθ(τ={𝒔1:T,𝒂1:T}|m)=1Z(θ)[η(s1)t=1TP(st+1|st,at)]exp(t=1Tfθ(st,at,m)) (6)

where Z(θ) is the partition function, i.e., an integral over all possible trajectories. Without further constraints over m, directly applying AIRL to learning the reward function (by augmenting each component of AIRL with an additional context variable m inferred by qψ) could simply ignore m, which is similar to the case of InfoGAIL (Li et al., 2017). Therefore, some connection between the reward function and the latent context variable m need to be established. With MaxEnt IRL, a parametrized reward function will induce a trajectory distribution. From the perspective of information theory, the mutual information between the context variable m and the trajectories sampled from the reward induced distribution will provide an ideal measure for such a connection.

Formally, the mutual information between two random variables m and τ under joint distribution pθ(m,τ)=p(m)pθ(τ|m) is given by:

Ipθ(m;τ)=𝔼mp(m),τpθ(τ|m)[logpθ(m|τ)-logp(m)] (7)

where pθ(τ|m) is the conditional distribution (Equation (6)), and pθ(m|τ) is the corresponding posterior distribution.

As we do not have access to the prior distribution p(m) and posterior distribution pθ(m|τ), directly optimizing the mutual information in Equation (7) is intractable. Fortunately, we can leverage qψ(m|τ) as a variational approximation to pθ(m|τ) to reason about the uncertainty over tasks, as well as conduct approximate sampling from p(m) (we will elaborate this later in Section 3.3). Formally, let pπE(τ) denote the expert trajectory distribution, we have the following desiderata:

Desideratum 1. Matching conditional distributions: 𝔼p(m)[DKL(pπE(τ|m)||pθ(τ|m))]=0

Desideratum 2. Matching posterior distributions: 𝔼pθ(τ)[DKL(pθ(m|τ)||qψ(m|τ))]=0

The first desideratum will encourage the θ-induced conditional trajectory distribution to match the empirical distribution implicitly defined by the expert demonstrations, which is equivalent to the MLE objective in the MaxEnt IRL framework. Note that they also share the same marginal distribution over the context variable p(m), which implies that matching the conditionals in Desideratum 1 will also encourage the joint distributions, conditional distributions pπE(m|τ) and pθ(m|τ), and marginal distributions over τ to be matched. The second desideratum will encourage the variational posterior qψ(m|τ) to be a good approximation to pθ(m|τ) such that qψ(m|τ) can correctly infer the latent context variable given a new expert demonstration sampled from a new task.

With the mutual information (Equation (7)) being the objective, and Desideratum 1 and 2 being the constraints, the meta-inverse reinforcement learning with probabilistic context variables problem can be interpreted as a constrained optimization problem, whose Lagrangian dual function is given by:

minθ,ψ-Ipθ(m;τ)+α𝔼p(m)[DKL(pπE(τ|m)||pθ(τ|m))]+β𝔼pθ(τ)[DKL(pθ(m|τ)||qψ(m|τ))] (8)

With the Lagrangian multipliers taking specific values (α=1,β=1) (Zhao et al., 2018), the above Lagrangian dual function can be rewritten as:

maxθ,ψ-𝔼p(m)[DKL(pπE(τ|m)||pθ(τ|m))]+𝔼mp(m),τpθ(τ|m)[logqψ(m|τ)] (9)
= maxθ,ψ-𝔼p(m)[DKL(pπE(τ|m)||pθ(τ|m))]+info(θ,ψ) (10)

Here the negative entropy term -Hp(m)=𝔼pθ(m,τ)[logp(m)]=𝔼p(m)[logp(m)] is omitted (in Eq. (9)) as it can be treated as a constant in the optimization procedure of parameters θ and ψ.

3.3 Achieving Tractability with Sampling-Based Gradient Estimation

Note that Equation (10) cannot be evaluated directly, as the first term requires estimating the KL divergence between the empirical expert distribution and the energy-based trajectory distribution pθ(τ|m) (induced by the θ-parametrized reward function), and the second term requires sampling from it. For the purpose of optimizing the first term in Equation (10), as introduced in Section 2, we can employ the adversarial reward learning framework (Fu et al., 2017) to construct an efficient sampling-based approximation to the maximum likelihood objective. Note that different from the original AIRL framework, now the adaptive sampler πω(a|s,m) is additionally conditioned on the context variable m. Furthermore, we here introduce the following lemma, which will be helpful for deriving the optimization of the second term in Equation (10).

Lemma 1.

In context variable augmented Adversarial IRL (with the adaptive sampler being πω(a|s,m) and the discriminator being Dθ(s,a,m)=exp(fθ(s,a,m))exp(fθ(s,a,m))+πω(a|s,m)) , under deterministic dynamics, when training the adaptive sampler πω with reward signal (logDθ-log(1-Dθ)) to optimality, the trajectory distribution induced by πω* corresponds to the maximum entropy trajectory distribution with fθ(s,a,m) serving as the reward function:


See Appendix A. ∎

Now we are ready to introduce how to approximately optimize the second term of the objective in Equation (10) w.r.t. θ and ψ. First, we observe that the gradient of info(θ,ψ) w.r.t. ψ is given by:

ψinfo(θ,ψ)=𝔼mp(m),τpθ(τ|m)1q(m|τ,ψ)q(m|τ,ψ)ψ (11)

Thus to construct an estimate of the gradient in Equation (11), we need to obtain samples from the θ-induced trajectory distribution pθ(τ|m). With Lemma 1, we know that when the adaptive sampler πω in AIRL is trained to optimality, we can use πω* to construct samples, as the trajectory distribution pπω*(τ|m) matches the desired distribution pθ(τ|m).

Also note that the expectation in Equation (11) is also taken over the prior task distribution p(m). In cases where we have access to the ground-truth prior distribution, we can directly sample m from it and use pπω*(τ|m) to construct a gradient estimation. For the most general case, where we do not have access to p(m) but instead have expert demonstrations sampled from pπE(τ), we use the following generative process:

τpπE(τ),mqψ(m|τ) (12)

to synthesize latent context variables, which approximates the prior task distribution when θ and ψ are trained to optimality.

To optimize info(θ,ψ) w.r.t. θ, which is an important step of updating the reward function parameters such that it encodes the information of the latent context variable, different from the optimization of Equation (11), we cannot directly replace pθ(τ|m) with pπω(τ|m). The reason is that we can only use the approximation of pθ to do inference (i.e. computing the value of an expectation). When we want to optimize an expectation (info(θ,ψ)) w.r.t. θ and the expectation is taken over pθ itself, we cannot instead replace pθ with πω to do the sampling for estimating the expectation. In the following, we discuss how to estimate the gradient of info(θ,ψ) w.r.t. θ with empirical samples from πω.

Lemma 2.

The gradient of L𝑖𝑛𝑓𝑜(θ,ψ) w.r.t. θ can be estimated with:


When ω is trained to optimality, the estimation is unbiased.


See Appendix B. ∎

With Lemma 2, as before, we can use the generative process in Equation (12) to sample m and use the conditional trajectory distribution pπω*(τ|m) to sample trajectories for estimating θinfo(θ,ψ). The overall training objective of PEMIRL is:

       𝔼τEpπE(τ),mqψ(m|τE)log(Dθ(s,a,m))+info(θ,ψ) (13)

We summarize the meta-training procedure in Algorithm 1 and the meta-test procedure in Appendix C.

Algorithm 1 PEMIRL Meta-Training
  Input: Expert trajectories 𝒟E={τEj}; Initial parameters of fθ,πω,qψ.
     Sample two batches of unlabeled demonstrations: τE,τE𝒟E
     Infer a batch of latent context variables from the sampled demonstrations: mqψ(m|τE)
     Sample trajectories 𝒟 from πω(τ|m), with the latent context variable fixed during each rollout and included in 𝒟.
     Update ψ to increase info(θ,ψ) with gradients in Equation (11), with samples from 𝒟.
     Update θ to increase info(θ,ψ) with gradients in Equation (15), with samples from 𝒟.
     Update θ to decrease the binary classification loss:
     Update ω with TRPO to increase the following objective: 𝔼(s,a,m)𝒟[logDθ(s,a,m)]
  until Convergence
  Output: Learned inference model qψ(m|τ), reward function fθ(s,a,m) and policy πω(a|s,m).

4 Related Work

Inverse reinforcement learning (IRL), first introduced by Ng and Russell (2000), is the problem of learning reward functions directly from expert demonstrations. Prior work tackling IRL include margin-based methods Abbeel and Ng (2004); Ratliff et al. (2006) and maximum entropy (MaxEnt) methods Ziebart et al. (2008). Margin-based methods suffer from being an underdefined problem, while MaxEnt requires the algorithm to solve the forward RL problem in the inner loop, making it challenging to use in non-tabular settings. Recent works have scaled MaxEnt IRL to large function approximators, such as neural networks, by only partially solving the forward problem in the inner loop, developing an adversarial framework for IRL Finn et al. (2016, 2016); Fu et al. (2017); Peng et al. (2018). Other imitation learning approaches Ho and Ermon (2016); Li et al. (2017); Hausman et al. (2017); Kuefler and Kochenderfer (2018) are also based on the adversarial framework, but they do not recover a reward function. We build upon the ideas in these single-task IRL works. Instead of considering the problem of learning reward functions for a single task, we aim at the problem of inferring a reward that is disentangled from the environment dynamics and can quickly adapt to new tasks from a single demonstration by leveraging prior data.

We base our work on the problem of meta-learning. Prior work has proposed memory-based methods Duan et al. (2016); Santoro et al. (2016); Munkhdalai and Yu (2017); Rakelly et al. (2019) and methods that learn an optimizer and/or a parameter initialization Andrychowicz et al. (2016); Li and Malik (2017); Finn et al. (2017a). We adopt a memory-based meta-learning method similar to Rakelly et al. (2019), which uses a deep latent variable generative model Kingma and Welling (2013) to infer different tasks from demonstrations. While prior multi-task and meta-RL methods Hausman et al. (2018); Rakelly et al. (2019); Sæmundsson et al. (2018) have investigated the effectiveness of applying latent variable generative models to learning task embeddings, we focus on the IRL problem instead. Meta-IRL Xu et al. (2018); Gleave and Habryka (2018) incorporates meta-learning and IRL, showing fast adaptation of the reward functions to unseen tasks. Unlike these approaches, our method is not restrictred to discrete tabular settings and does not require access to grouped demonstrations sampled from a task distribution. Meanwhile, one-shot imitation learning Duan et al. (2017); Finn et al. (2017b); Yu et al. (2018b, a) demonstrates impressive results on learning new tasks using a single demonstration; yet, they also require paired demonstrations from each task and hence need prior knowledge on the task distribution. More importantly, one-shot imitation learning approaches only recover a policy, and cannot use additional trials to continue to improve, which is possible when a reward function is inferred instead. Several prior approaches for multi-task imitation learning Li et al. (2017); Hausman et al. (2017); Sharma et al. (2018) propose to use unstructured demonstrations without knowing the task distribution, but they neither study quick generalization to new tasks nor provide a reward function. Our work is thus driven by the goal of extending meta-IRL to addressing challenging high-dimensional control tasks with the help of an unstructured demonstration dataset.

5 Experiments

In this section, we seek to investigate the following two questions: (1) Can PEMIRL learn a policy with competitive few-shot generalization abilities compared to one-shot imitation learning methods using only unstructured demonstrations? (2) Can PEMIRL efficiently infer robust reward functions of new continuous control tasks where one-shot imitation learning fails to generalize, enabling an agent to continue to improve with more trials?

We evaluate our method on four simulated domains using the Mujoco physics engine Todorov et al. (2012). To our knowledge, there’s no prior work on designing meta-IRL or one-shot imitation learning methods for complex domains with high-dimensional continuous state-action spaces with unstructured demonstrations. Hence, we also designed the following variants of existing state-of-the-art (one-shot) imitation learning and IRL methods so that they can be used as fair comparisons to our method:

  • AIRL: The original AIRL algorithm without incorporating latent context variables, trained across all demonstrations.

  • Meta-Imitation Learning with Latent Context Variables (Meta-IL): As in (Rakelly et al., 2019), we use the inference model qψ(m|τ) to infer the context of a new task from a single demonstrated trajectory, denoted as m^, and then train the conditional imitaiton policy πω(a|s,m^) using the same demonstration. This approach also resembles Duan et al. (2017).

  • Meta-InfoGAIL: Similar to the method above, except that an additional discriminator D(s,a) is introduced to distinguish between expert and sample trajectories, and trained along with the conditional policy using InfoGAIL Li et al. (2017) objective.

We use trust region policy optimization (TRPO) Schulman et al. (2015) as our policy optimization algorithm across all methods. We collect demonstrations by training experts with TRPO using ground truth reward. However, the ground truth reward is not available to imitation learning and IRL algorithms. We provide full hyperparameters, architecture information, data efficiency, and experimental setup details in Appendix F. We also include ablation studies on sensitivity of the latent dimensions, importance of the mutual information objective and the performance on stochastic environments in Appendix E. Full video results are on the anonymous supplementary website11 1 Video results can be found at: and our code is open-sourced on GitHub22 2 Our implementation of PEMIRL can be found at:

Figure 1: Experimental domains (left to right): Point-Maze, Ant, Sweeper, and Sawyer Pusher.

5.1 Policy Performance on Test Tasks

Point Maze Ant Sweeper Sawyer Pusher
Expert -5.21±0.93 968.80±27.11 -50.86±4.75 -23.36±2.54
Random -51.39±10.31 -55.65±18.39 -259.71±11.24 -106.88±18.06
AIRL Fu et al. (2017) -18.15±3.17 127.61±27.34 -152.78±7.39 -51.56±8.57
Meta-IL -6.68±1.51 218.53±26.48 -89.02±7.06 -28.13±4.93
Meta-InfoGAIL -7.66±1.85 871.93±31.28 -87.06±6.57 -27.56±4.86
PEMIRL (ours) -7.37±1.02 846.18±25.81 -74.17±5.31 -27.16±3.11
Table 1: One-shot policy generalization to test tasks on four experimental domains. Average return and standard deviations are reported over 5 runs.

We first answer our first question by showing that our method is able to learn a policy that can adapt to test tasks from a single demonstration, on four continuous control domains: Point Maze Navigation: In this domain, a pointmass needs to navigate around a barrier to reach the goal. Different tasks correspond to different goal positions and the reward function measures the distance between the pointmass and the goal position; Ant: Similar to Finn et al. (2017a), this locomotion task requires fast adaptation to walking directions of the ant where the ant needs to learn to move backward or forward depending on the demonstration; Sweeper: A robot arm needs to sweep an object to a particular goal position. Fast adaptation of this domain corresponds to different goal locations in the plane; Sawyer Pusher: A simulated Sawyer robot is required to push a mug to a variety of goal positions and generalize to unseen goals. We illustrate the set-up for these experimental domains in Figure 1.

We summarize the results in Table 1. PEMIRL achieves comparable imitation performance compared to Meta-IL and Meta-InfoGAIL, while AIRL is incapable of handling multi-task scenarios without incorporating the latent context variables.

5.2 Reward Adaptation to Challenging Situations

Figure 2: Visualizations of learned reward functions for point-maze navigation. The red star represents the target position and the white circle represents the initial position of the agent (both are different across different iterations). The black horizontal line represents the barrier that cannot be crossed. To show the generalization ability, the expert demonstration used to infer the target position are sampled from new target positions that have not been seen in the meta-training set.
Figure 3: From top to bottom, we show the disabled ant running forward and backward respectively.

After demonstrating that the policy learned by our method is able to achieve competitive “one-shot” generalization ability, we now answer the second question by showing PEMIRL learns a robust reward that can adapt to new and more challenging settings where the imitation learning methods and the original AIRL fail. Specifically, after providing the demonstration of an unseen task to the agent, we change the underlying environment dynamics but keep the same task goal. In order to succeed in the task with new dynamics, the agent must correctly infer the underlying goal of the task instead of simply mimicking the demonstration. We show the effectiveness of our reward generalization by training a new policy with TRPO using the learned reward functions on the new task.

Point-Maze Navigation with a Shifted Barrier. Following the setup of  Fu et al. (2017), at meta-test time, after showing a demonstration moving towards a new target position, we change the position of the barrier from left to right. As the agent must adapt by reaching the target with a different path from what was demonstrated during meta-training, it cannot succeed without correctly inferring the true goal (the target position in the maze) and learning from trial-and-error. As a result, all direct policy generalization approaches fail as all the policies are still directing the pointmass to the right side of the maze. As shown in Figure 2, PEMIRL learns disentangled reward functions that successfully infer the underlying goal of the new task without much reward shaping. Such reward functions enable the RL agent to bypass the right barrier and reach the true goal position. The RL agent trained with the reward learned by AIRL also fail to bypass the barrier and navigate to the target position, as without incorporating the latent context variables and treating the demonstration as multi-modal, AIRL learns an “average” reward and policy among different tasks. We also use the output of the discriminator of Meta-InfoGAIL as reward signals and evaluate its adaptation performance. The agent trained by this reward fails to complete the task since Meta-InfoGAIL does not explicitly optimize for reward learning and the discriminator output converges to uninformative uniform distribution at convergence.

Method Point-Maze-Shift Disabled-Ant
Policy Generalization Meta-IL -28.61±3.71 -27.86±10.31
Meta-InfoGAIL -29.40±3.05 -51.08±4.81
PEMIRL -28.93±3.59 -46.77±5.54
Reward Adaptation AIRL -29.07±4.12 -76.21±10.35
Meta-InfoGAIL -29.72±3.11 -38.73±6.41
PEMIRL (ours) -9.04±1.09 152.62±11.75
Expert -5.37±0.86 331.17±17.82
Table 2: Results on direct policy generalization and reward adaptation to challenging situations. Policy generalization examines if the policy learned by Meta-IL is able to generalize to new tasks with new dynamics, while reward adaptation tests if the learned RL can lead to efficient RL training in the same setting. The RL agent learned by PEMIRL rewards outperforms other methods in such challenging settings.

Disabled Ant Walking. As in  Fu et al. (2017), we disable and shorten two front legs of the ant such that it cannot walk without changing its gait to a large extent. Similar to Point-Maze-Shift, all imitaiton policies fail to maneuver the disabled ant to the right direction. As shown in Figure 3, reward functions learned by PEMIRL encourage the RL policy to orient the ant towards the demonstrated direction and move along that direction using two healthy legs, which is only possible when the inferred reward corresponds to the true underlying goal and is disentangled with the dynamics. In contrast, the learned reward of original AIRL as well as the discriminator output of Meta-InfoGAIL cannot infer the underlying goal of the task and provide precise supervision signal, which leads to the unsatisfactory performance of the induced RL policies. Quantitative results are presented in Table 2.

6 Conclusion

In this paper, we propose a new meta-inverse reinforcement learning algorithm, PEMIRL, which is able to efficiently infer robust reward functions that are disentangled from the dynamics and highly correlated with the ground-truth rewards under meta-learning settings. To our knowledge, PEMIRL is the first model-free Meta-IRL algorithm that can achieve this and scale to complex domains with continuous state-action spaces. PEMIRL generalizes to new tasks by performing inference over a latent context variable with a single demonstration, on which the recovered policy and reward function are conditioned. Extensive experimental results demonstrate the scalability and effectiveness of our method against strong baselines.


This research was supported by Toyota Research Institute, NSF (#1651565, #1522054, #1733686), ONR (N00014-19-1-2145), AFOSR (FA9550- 19-1-0024). The authors would like to thank Chris Cundy for discussions over the paper draft.


  • P. Abbeel and A. Y. Ng (2004) Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, ACM, pp. 1. Cited by: §1, §4.
  • D. Amodei, C. Olah, J. Steinhardt, P. F. Christiano, J. Schulman, and D. Mané (2016) Concrete problems in AI safety. arXiv preprint arXiv:1606.06565. Cited by: §1.
  • M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas (2016) Learning to learn by gradient descent by gradient descent. arXiv preprint arXiv:1606.04474. Cited by: §4.
  • Y. Bengio, S. Bengio, and J. Cloutier (1991) Learning a synaptic learning rule. In IJCNN-91-Seattle International Joint Conference on Neural Networks, Cited by: §1.
  • Y. Duan, M. Andrychowicz, B. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba (2017) One-shot imitation learning. Neural Information Processing Systems (NIPS). Cited by: §4, 2nd item.
  • Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel (2016) RL$^2$: fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779. Cited by: §1, §3.1, §4.
  • C. Finn, P. Abbeel, and S. Levine (2017a) Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, Cited by: §1, §3.1, §4, §5.1.
  • C. Finn, P. Christiano, P. Abbeel, and S. Levine (2016) A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. arXiv preprint arXiv:1611.03852. Cited by: §2, §4.
  • C. Finn, S. Levine, and P. Abbeel (2016) Guided cost learning: deep inverse optimal control via policy optimization. In International Conference on Machine LearningInternational Conference on Machine Learning, pp. 49–58 (en). Cited by: §4.
  • C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine (2017b) One-shot visual imitation learning via meta-learning. Cited by: §4.
  • J. Fu, K. Luo, and S. Levine (2017) Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248. Cited by: §F.1, §F.3, §1, §2, §3.3, §4, §5.2, §5.2, Table 1.
  • A. Gleave and O. Habryka (2018) Multi-task maximum entropy inverse reinforcement learning. arXiv preprint arXiv:1805.08882. Cited by: §1, §4.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
  • K. Hausman, Y. Chebotar, S. Schaal, G. Sukhatme, and J. J. Lim (2017) Multi-modal imitation learning from unstructured demonstrations using generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 1235–1245. Cited by: §4, §4.
  • K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. Riedmiller (2018) Learning an embedding space for transferable robot skills. Cited by: §4.
  • J. Ho and S. Ermon (2016) Generative adversarial imitation learning. In Advances in Neural Information Processing Systems 29, pp. 4565–4573. Cited by: §1, §4.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §4.
  • A. Kuefler and M. J. Kochenderfer (2018) Burn-in demonstrations for multi-modal imitation learning. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’18. Cited by: §4.
  • S. Levine (2018) Reinforcement learning and control as probabilistic inference: tutorial and review. arXiv preprint arXiv:1805.00909. Cited by: Appendix A.
  • K. Li and J. Malik (2017) Learning to optimize neural nets. arXiv preprint arXiv:1703.00441. Cited by: §4.
  • Y. Li, J. Song, and S. Ermon (2017) Infogail: interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems, pp. 3812–3822. Cited by: §3.2, §4, §4, 3rd item.
  • T. Munkhdalai and H. Yu (2017) Meta networks. International Conference on Machine Learning (ICML). Cited by: §4.
  • A. Y. Ng, D. Harada, and S. Russell (1999) Policy invariance under reward transformations: theory and application to reward shaping. In ICML, Vol. 99, pp. 278–287. Cited by: §3.1.
  • A. Y. Ng and S. J. Russell (2000) Algorithms for inverse reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00. Cited by: §1, §4.
  • X. B. Peng, A. Kanazawa, S. Toyer, P. Abbeel, and S. Levine (2018) Variational discriminator bottleneck: improving imitation learning, inverse rl, and gans by constraining information flow. arXiv preprint arXiv:1810.00821. Cited by: §4.
  • K. Rakelly, A. Zhou, D. Quillen, C. Finn, and S. Levine (2019) Efficient off-policy meta-reinforcement learning via probabilistic context variables. arXiv preprint arXiv:1903.08254. Cited by: §1, §4, 2nd item.
  • N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich (2006) Maximum margin planning. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06. Cited by: §4.
  • S. Ross, G. Gordon, and D. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. Cited by: §3.1.
  • S. Sæmundsson, K. Hofmann, and M. P. Deisenroth (2018) Meta reinforcement learning with latent variable gaussian processes. arXiv preprint arXiv:1803.07551. Cited by: §4.
  • A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap (2016) Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning (ICML), Cited by: §4.
  • S. Schaal, A. Ijspeert, and A. Billard (2003) Computational approaches to motor learning by imitation. Philosophical Transactions of the Royal Society of London B: Biological Sciences. Cited by: §1.
  • J. Schmidhuber (1987) Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-… hook. Ph.D. Thesis, Technische Universität München. Cited by: §1.
  • J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel (2015) Trust region policy optimization. International Conference on Machine Learning. Cited by: §5.
  • A. Sharma, M. Sharma, N. Rhinehart, and K. M. Kitani (2018) Directed-info GAIL: learning hierarchical policies from unsegmented demonstrations using directed information. arXiv preprint arXiv:1810.01266. Cited by: §4.
  • E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In International Conference on Intelligent Robots and Systems (IROS), Cited by: §5.
  • K. Xu, E. Ratner, A. Dragan, S. Levine, and C. Finn (2018) Learning a prior over intent via meta-inverse reinforcement learning. arXiv preprint arXiv:1805.12573. Cited by: §1, §4.
  • T. Yu, P. Abbeel, S. Levine, and C. Finn (2018a) One-shot hierarchical imitation learning of compound visuomotor tasks. arXiv preprint arXiv:1810.11043. Cited by: §4.
  • T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel, and S. Levine (2018b) One-shot imitation from observing humans via domain-adaptive meta-learning. Robotics: Science and Systems (R:SS). Cited by: §1, §4.
  • T. Zhang, Z. McCarthy, O. Jow, D. Lee, K. Goldberg, and P. Abbeel (2017) Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. arXiv preprint arXiv:1710.04615. Cited by: §1.
  • S. Zhao, J. Song, and S. Ermon (2018) The information autoencoding family: a lagrangian perspective on latent variable generative models. arXiv preprint arXiv:1806.06514. Cited by: §3.2.
  • B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey (2008) Maximum entropy inverse reinforcement learning.. In Aaai, Vol. 8, pp. 1433–1438. Cited by: §1, §2, §4.
  • B. D. Ziebart (2010) Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Cited by: §1.

Appendix A Proof of Lemma 1

From Section 2.3 and 2.4 in [Levine, 2018], we know that the policy whose induced trajectory distribution is Equation (6) takes the following energy-based form:

πθ(at|st,m) =exp(Qsoft(st,at,m)-Vsoft(st,m))
Qsoft(st,at,m) =fθ(st,at,m)+log𝔼st+1P(|st,at,m)[exp(Vsoft(st+1,m))]
Vsoft(st,m) =log𝒜exp(Qsoft(st,a,m))𝑑a

which corresponds to the optimal policy to the following entropy regularized reinforcement learning problem (for a certain value of m):

maxπ𝔼π[t=1Tfθ(st,at,m)-logπ(at|st,m)] (14)

From Section 2, we know that Equation (14) is exactly the training objective for the adaptive sampler πω in AIRL. Thus, the trajectory distribution of the optimal policy πω* matches pθ(τ|m) defined in Equation (6).

Appendix B Proof of Lemma 2

First, the gradient of info(θ,ψ) w.r.t. θ can be written as:

θinfo(θ,ψ)=𝔼mp(m),τp(τ|m,θ)logq(m|τ,ψ)θlogpθ(τ|m) (15)

As pθ(τ|m) is an energy-based distribution (Equation (6)), we need to derive the gradient of logp(τ|m,θ) w.r.t. θ:

θlogp(τ|m,θ) =θ[log(η(s1)t=1TP(st+1|st,at))+t=1Tfθ(st,at,m)-logZ(θ)] (16)
=t=1Tθfθ(st,at,m)-θlogZ(θ) (17)
=t=1Tθfθ(st,at,m)-𝔼τp(τ|m,θ)[t=1Tθfθ(st,at,m)] (18)

Substituting Equation (18) into Equation (15), we get:


With Lemma 1, we know that when ω is trained to optimality, we can sample from pπω*(τ|m) to construct an unbiased gradient estimation.

Appendix C Meta-Testing Procedure of PEMIRL

We summarize the meta-test stage of PEMIRL for adapting reward functions to new tasks in Algorithm 2.

Algorithm 2 PEMIRL Meta-Test for Reward Adaptation
  Input: A test context variable mp(m), a test expert demonstration τEpπE(τ|m), and ground-truth reward r(s,a,m).
  Infer the latent context variable from the test demonstration: m^qψ(m|τE).
  Train a policy using TRPO w.r.t. adapted reward function fθ(s,a,m^).
  Evaluate the learned policy with r(s,a,m).

Appendix D Graphical Model of PEMIRL

Figure 4: Graphical model underlying PEMIRL.

Here we show the graphical model of the PEMIRL framework in Figure 4.

Appendix E Ablation Studies

In this section, we perform ablation studies on the sensitivity of the latent dimensions, importance of the mutual information loss (info) term, and stochasticity of the environment. We conduct each ablation study on the Point-Maze-Shift environment to evaluation the reward adaptation performance.

latent dim. return 1 -10.58±1.27 3 -14.13±1.21 5 -15.41±1.40
Table 3: PEMIRL is robust to latent dimensions.
method return PEMIRL w/o MI -39.24±3.48 PEMIRL -14.13±1.21
Table 4: The MI term is important for training PEMIRL.
method return Meta-IL -30.58±4.17 PEMIRL -17.39±0.84
Table 5: PEMIRL excels in stochastic env.

Sensitivity of the latent dimension. We first investigate the sensitivity of different latent dimensions by running PEMIRL with latent dimension picked from {1,3,5} on Point-Maze-Shift where the ground-truth latent dimension is 3. The results are summarized in Table 5. We can observe that PEMIRL with various latent dimension specifications all outperform the best baseline (return -28.61) stably and is hence robust to dimension mis-specifications.

Importance of L𝐢𝐧𝐟𝐨. As shown in Table 5, the reward function learned by PEMIRL without the mutual information objective failed to induce a good policy in the reward adaptation setting, which demonstrates the importance of using info.

Performance on stochastic environment. We create a stochastic version of Point-Maze-Shift (maze size: 60×100 cm) by changing its deterministic transition dynamics into a stochastic one. Specifically, p(st+1|st,at) is now realized as a Gaussian with standard deviation being 1 cm. As shown in Table 5, the average return of PEMIRL outperforms the best baseline Meta-IL by a large margin.

Appendix F Additional Experimental Details

F.1 Network Architectures

For all methods except AIRL, qψ(m|τ) and πω(a|s,m) are represented as 2-layer fully-connected neural networks with 128 and 64 hidden units respectively and ReLU as the activation function.

Following Fu et al. [2017], to alleviate the reward ambiguity problem, we represent the reward function with two components (a context-dependent disentangled reward estimator rθ(s,m) and a context-dependent potential function hϕ(s,m)):


Here rθ(s,m) and hϕ(s,m) are realized as a 2-layer fully-connected neural networks with 32 hidden units.

F.2 Environment Details

Point-Maze. The ground-truth reward corresponds to negative distance toward the goal position as well as controlling the pointmass from moving too fast. We use 100 meta-training tasks and 30 meta-training tasks.

Ant. The ground-truth reward corresponds to moving as far as possible forward or backward without being flipped. We have 2 tasks in this domain.

Sweeper. The ground-truth reward is the negative distance from the sweeper to the object plus the negative distance from the object to the goal position. We train all methods on 100 meta-training tasks and test them on 30 meta-test tasks.

Sawyer Pushing. The ground-truth reward in this domain is similar to Sweeper, and we also use 100 meta-training tasks and 30 meta-test tasks.

F.3 Training Details

Training the policy. During training TRPO, we use an entropy regularizer 1.0 for Point-Maze, and 0.1 for the other three domains. We find that adding an imitation objective in PEMIRL that maximizes the log-likelihood of the sampled expert trajectory conditioned on the latent context variable inferred by qψ with scaling factor 0.01 accelerates policy training.

Training the inference network and the reward model. We train qψ(m|τ), rθ(s,m) and hϕ(s,m) using the Adam optimizer with default hyperparameters.

Scaling up the mutual information regularization. Note that in Equation 10, β does not necessarily need to be equal to 1. Adjusting β is equivalent to scaling info(θ,ψ). We scale info(θ,ψ) by 0.1 for all of our experiments.

Policy and inference network initialization. We initialize and qψ(m|τ) using Meta-IL discussed in Section 5 while randomly initializing the policy πω(a|s,m).

Stabilizing adversarial training. As in  Fu et al. [2017], we mix policy samples generated from previous 20 training iterations and use them as negatives when training the disriminator. We find that such a strategy prevents the discriminator from overfitting to samples from the current iteration.

F.4 Data Efficiency

During meta-training, for the Point-Maze environment, it takes about 32M simulation steps to converge (similar to other methods such as Meta-InfoGAIL that takes 28M), which amounts to about 2 hours on one Nvidia Titan-Xp GPU; for the Ant environment, it takes about 13.8M simulation steps (Meta-InfoGAIL takes 12M) and about 40 hours on the same hardware (the state-action dimension is much larger than that of Point-Maze).

At meta-testing phase, the data efficiency of PEMIRL is comparable to RL training with the oracle ground-truth reward as shown in Table 6.

Point-Maze-Shift Disabled-Ant
RL w/ oracle reward 4M env steps 15M env steps
PEMIRL 5.4M env steps 18M env steps
Table 6: Comparison on data efficiency between RL trained with reward learned by PERMIL and RL trained with oracle reward. The methods have been shown to have similar data efficiency on Point-Maze-Shift and Disabled-Ant.