Abstract
Providing a suitable reward function to reinforcement learning can bedifficult in many real world applications. While inverse reinforcement learning(IRL) holds promise for automatically learning reward functions fromdemonstrations, several major challenges remain. First, existing IRL methodslearn reward functions from scratch, requiring large numbers of demonstrationsto correctly infer the reward for each task the agent may need to perform.Second, existing methods typically assume homogeneous demonstrations for asingle behavior or task, while in practice, it might be easier to collectdatasets of heterogeneous but related behaviors. To this end, we propose a deeplatent variable model that is capable of learning rewards from demonstrationsof distinct but related tasks in an unsupervised way. Critically, our model caninfer rewards for new, structurallysimilar tasks from a single demonstration.Our experiments on multiple continuous control tasks demonstrate theeffectiveness of our approach compared to stateoftheart imitation andinverse reinforcement learning methods.
Quick Read (beta)
MetaInverse Reinforcement Learning with Probabilistic Context Variables
Abstract
Providing a suitable reward function to reinforcement learning can be difficult in many real world applications. While inverse reinforcement learning (IRL) holds promise for automatically learning reward functions from demonstrations, several major challenges remain. First, existing IRL methods learn reward functions from scratch, requiring large numbers of demonstrations to correctly infer the reward for each task the agent may need to perform. Second, existing methods typically assume homogeneous demonstrations for a single behavior or task, while in practice, it might be easier to collect datasets of heterogeneous but related behaviors. To this end, we propose a deep latent variable model that is capable of learning rewards from demonstrations of distinct but related tasks in an unsupervised way. Critically, our model can infer rewards for new, structurallysimilar tasks from a single demonstration. Our experiments on multiple continuous control tasks demonstrate the effectiveness of our approach compared to stateoftheart imitation and inverse reinforcement learning methods.
MetaInverse Reinforcement Learning with Probabilistic Context Variables
Lantao Yu^{†}^{†}thanks: Equal contribution. , Tianhe Yu^{*}, Chelsea Finn, Stefano Ermon Department of Computer Science, Stanford University Stanford, CA 94305 {lantaoyu,tianheyu,cbfinn,ermon}@cs.stanford.edu
noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\[email protected]
1 Introduction
While reinforcement learning (RL) has been successfully applied to a range of decisionmaking and control tasks in the real world, it relies on a key assumption: having access to a welldefined reward function that measures progress towards the completion of the task. Although it can be straightforward to provide a highlevel description of success conditions for a task, existing RL algorithms usually require a more informative signal to expedite exploration and learn complex behaviors in a reasonable time. While reward functions can be handspecified, reward engineering can require significant human effort. Moreover, for many realworld tasks, it can be challenging to manually design reward functions that actually benefit RL training, and reward misspecification can hamper autonomous learning Amodei et al. (2016).
Learning from demonstrations Schaal et al. (2003) sidesteps the reward specification problem by instead learning directly from expert demonstrations, which can be obtained through teleoperation Zhang et al. (2017) or from humans experts Yu et al. (2018b). Demonstrations can often be easier to provide than rewards, as humans can complete many realworld tasks quite efficiently. Two major methodologies of learning from demonstrations include imitation learning and inverse reinforcement learning. Imitation learning is simple and often exhibits good performance Zhang et al. (2017); Ho and Ermon (2016). However, it lacks the ability to transfer learned policies to new settings where the task specification remains the same but the underlying environment dynamics change. As the reward function is often considered as the most succinct, robust and transferable representation of a task (Abbeel and Ng, 2004; Fu et al., 2017), the problem of inferring reward functions from expert demonstrations, i.e. inverse RL (IRL) Ng and Russell (2000), is important to consider.
While appealing, IRL still typically relies on large amounts of highquality expert data, and it can be prohibitively expensive to collect demonstrations that cover all kinds of variations in the wild (e.g. opening all kinds of doors or navigating to all possible target positions). As a result, these methods are datainefficient, particularly when learning rewards for individual tasks in isolation, starting from scratch. On the other hand, metalearning Schmidhuber (1987); Bengio et al. (1991), also known as learning to learn, seeks to exploit the structural similarity among a distribution of tasks and optimizes for rapid adaptation to unknown settings with a limited amount of data. As the reward function is able to succinctly capture the structure of a reinforcement learning task, e.g. the goal to achieve, it is promising to develop methods that can quickly infer the structure of a new task, i.e. its reward, and train a policy to adapt to it. Xu et al. (2018) and Gleave and Habryka (2018) have proposed approaches that combine IRL and gradientbased metalearning Finn et al. (2017a), which provide promising results on deriving generalizable reward functions. However, they have been limited to tabular MDPs Xu et al. (2018) or settings with provided task distributions Gleave and Habryka (2018), which are challenging to gather in realworld applications.
The primary contribution of this paper is a new framework, termed Probabilistic Embeddings for MetaInverse Reinforcement Learning (PEMIRL), which enables metalearning of rewards from unstructured multitask demonstrations. In particular, PEMIRL combines and integrates ideas from contextbased metalearning Duan et al. (2016); Rakelly et al. (2019), deep latent variable generative models Kingma and Welling (2013), and maximum entropy inverse RL (Ziebart et al., 2008; Ziebart, 2010), into a unified graphical model (see Figure 4 in Appendix D) that bridges the gap between fewshot reward inference and learning from unstructured, heterogeneous demonstrations. PEMIRL can learn robust reward functions that generalize to new tasks with a single demonstration on complex domains with continuous stateaction spaces, while metatraining on a set of unstructured demonstrations without specified task groupings or labeling for each demonstration. Our experiment results on various continuous control tasks including PointMaze, Ant, Sweeper, and Sawyer Pusher demonstrate the effectiveness and scalability of our method.
2 Preliminaries
Markov Decision Process (MDP). A discretetime finitehorizon MDP is defined by a tuple $(T,\mathcal{S},\mathcal{A},P,r,\eta )$, where $T$ is the time horizon; $\mathcal{S}$ is the state space; $\mathcal{A}$ is the action space; $P:\mathcal{S}\times \mathcal{A}\times \mathcal{S}\to [0,1]$ describes the (stochastic) transition process between states; $r:\mathcal{S}\times \mathcal{A}\to \mathbb{R}$ is a bounded reward function; $\eta \in \mathcal{P}(\mathcal{S})$ specifies the initial state distribution, where $\mathcal{P}(\mathcal{S})$ denotes the set of probability distributions over the state space $\mathcal{S}$. We use $\tau $ to denote a trajectory, i.e. a sequence of state action pairs for one episode. We also use ${\rho}_{\pi}({s}_{t})$ and ${\rho}_{\pi}({s}_{t},{a}_{t})$ to denote the state and stateaction marginal distribution encountered when executing a policy $\pi ({a}_{t}{s}_{t})$.
Maximum Entropy Inverse Reinforcement Learning (MaxEnt IRL). The maximum entropy reinforcement learning (MaxEnt RL) objective is defined as:
$$\underset{\pi}{\mathrm{max}}\sum _{t=1}^{T}{\mathbb{E}}_{({s}_{t},{a}_{t})\sim {\rho}_{\pi}}[r({s}_{t},{a}_{t})+\alpha \mathscr{H}(\pi (\cdot {s}_{t}))]$$  (1) 
which augments the reward function with a causal entropy regularization term $\mathscr{H}(\pi )={\mathbb{E}}_{\pi}[\mathrm{log}\pi (as)]$. Here $\alpha $ is an optional parameter to control the relative importance of reward and entropy. For notational simplicity, without loss of generality, in the following we will assume $\alpha =1$. Given some expert policy ${\pi}_{E}$ that is obtained by above MaxEnt RL procedure , the MaxEnt IRL framework (Ziebart et al., 2008) aims to find a reward function that rationalizes the expert behaviors, which can be interpreted as solving the following maximum likelihood estimation (MLE) problem:
$${p}_{\theta}(\tau )\propto \left[\eta ({s}_{1})\prod _{t=1}^{T}P({s}_{t+1}{s}_{t},{a}_{t})\right]\mathrm{exp}\left(\sum _{t=1}^{T}{r}_{\theta}({s}_{t},{a}_{t})\right)=\overline{{p}_{\theta}}(\tau )$$  (2)  
$$\underset{\theta}{\mathrm{arg}\mathrm{min}}{D}_{\mathrm{KL}}({p}_{{\pi}_{E}(\tau )}{p}_{\theta}(\tau ))=\underset{\theta}{\mathrm{arg}\mathrm{max}}{\mathbb{E}}_{{p}_{{\pi}_{E}}(\tau )}\left[\mathrm{log}{p}_{\theta}(\tau )\right]={\mathbb{E}}_{\tau \sim {\pi}_{E}}\left[\sum _{t=1}^{T}{r}_{\theta}({s}_{t},{a}_{t})\right]\mathrm{log}{Z}_{\theta}$$ 
Here, $\theta $ is the parameter of the reward function and ${Z}_{\theta}$ is the partition function, i.e. $\int \overline{{p}_{\theta}}(\tau )\mathit{d}\tau $, an integral over all possible trajectories consistent with the environment dynamics. ${Z}_{\theta}$ is intractable to compute when stateaction spaces are large or continuous, or environment dynamics are unknown.
Finn et al. (2016) and Fu et al. (2017) proposed the adversarial IRL (AIRL) framework as an efficient samplingbased approximation to MaxEnt IRL, which resembles Generative Adversarial Networks (Goodfellow et al., 2014). Specially, in AIRL, there is a discriminator ${D}_{\theta}$ (a binary classifier) parametrized by $\theta $ and an adaptive sampler ${\pi}_{\omega}$ (a policy) parametrized by $\omega $. The discriminator takes a particular form: ${D}_{\theta}(s,a)=\mathrm{exp}({f}_{\theta}(s,a))/(\mathrm{exp}({f}_{\theta}(s,a))+{\pi}_{\omega}(as))$ , where ${f}_{\theta}(s,a)$ is the learned reward function and ${\pi}_{\omega}(as)$ is precomputed as an input to the discriminator. The discriminator is trained to distinguish between the trajectories sampled from the expert and the adaptive sampler; while the adaptive sampler ${\pi}_{\omega}(as)$ is trained to maximize ${\mathbb{E}}_{{\rho}_{{\pi}_{\omega}}}[\mathrm{log}{D}_{\theta}(s,a)\mathrm{log}(1{D}_{\theta}(s,a))]$, which is equivalent to maximizing the following entropy regularized policy objective (with ${f}_{\theta}(s,a)$ serving as the reward function):
${\mathbb{E}}_{{\pi}_{\omega}}\left[{\displaystyle \sum _{t=1}^{T}}\mathrm{log}({D}_{\theta}({s}_{t},{a}_{t}))\mathrm{log}(1{D}_{\theta}({s}_{t},{a}_{t}))\right]={\mathbb{E}}_{{\pi}_{\omega}}\left[{\displaystyle \sum _{t=1}^{T}}{f}_{\theta}({s}_{t},{a}_{t})\mathrm{log}{\pi}_{\omega}({a}_{t}{s}_{t})\right]$  (3) 
Under certain conditions, it can be shown that the learned reward function will recover the groundtruth reward up to a constant (Theorem C.1 in Fu et al. (2017)).
3 Probabilistic Embeddings for MetaInverse Reinforcement Learning
3.1 Problem Statement
Before defining our metainverse reinforcement learning problem (MetaIRL), we first define the concept of optimal contextconditional policy.
We start by generalizing the notion of MDP with a probabilistic context variable denoted as $m\in \mathcal{M}$, where $\mathcal{M}$ is the (discrete or continuous) value space of $m$. For example, in a navigation task, the context variables could represent different goal positions in the environment. Now, each component of the MDP has an additional dependency on the context variable $m$. For example, by slightly overloading the notation, the reward function is now defined as $r:\mathcal{S}\times \mathcal{A}\times \mathcal{M}\to \mathbb{R}$. For simplicity, the state space, action space, initial state distribution and transition dynamics are often assumed to be independent of $m$ (Duan et al., 2016; Finn et al., 2017a), which we will follow in this work. Intuitively, different $m$’s correspond to different tasks with shared structures.
Given above definitions, the contextconditional trajectory distribution induced by a contextconditional policy $\pi :\mathcal{S}\times \mathcal{M}\to \mathcal{P}(\mathcal{A})$ can be written as:
${p}_{\pi}(\tau =\{{\bm{s}}_{1:T},{\bm{a}}_{1:T}\}m)=\eta ({s}_{1}){\displaystyle \prod _{t=1}^{T}}\pi ({a}_{t}{s}_{t},m)P({s}_{t+1}{s}_{t},{a}_{t})$  (4) 
Let $p(m)$ denote the prior distribution of the latent context variable (which is a part of the problem definition). With the conditional distribution defined above, the optimal entropyregularized contextconditional policy is defined as:
$${\pi}^{*}=\underset{\pi}{\mathrm{arg}\mathrm{max}}{\mathbb{E}}_{m\sim p(m),({\bm{s}}_{1:T},{\bm{a}}_{1:T})\sim {p}_{\pi}(\cdot m)}\left[\sum _{t=1}^{T}r({s}_{t},{a}_{t},m)\mathrm{log}\pi ({a}_{t}{s}_{t},m)\right]$$  (5) 
Now, let us introduce the problem of MetaIRL from heterogeneous multitask demonstration data. Suppose there is some groundtruth reward function $r(s,a,m)$ and a corresponding expert policy ${\pi}_{E}({a}_{t}{s}_{t},m)$ obtained by solving the optimization problem defined in Equation (5). Given a set of demonstrations i.i.d. sampled from the induced marginal distribution ${p}_{{\pi}_{E}}(\tau )={\int}_{\mathcal{M}}p(m){p}_{{\pi}_{E}}(\tau m)\mathit{d}m$, the goal is to metalearn an inference model $q(m\tau )$ and a reward function $f(s,a,m)$, such that given some new demonstration ${\tau}_{E}$ generated by sampling ${m}^{\prime}\sim p(m),{\tau}_{E}\sim {p}_{{\pi}_{E}}(\tau {m}^{\prime})$, with $\widehat{m}$ being inferred as $\widehat{m}\sim q(m{\tau}_{E})$, the learned reward function $f(s,a,\widehat{m})$ and the groundtruth reward $r(s,a,{m}^{\prime})$ will induce the same set of optimal policies Ng et al. (1999).
Critically, we assume no knowledge of the prior task distribution $p(m)$, the latent context variable $m$ associated with each demonstration, nor the transition dynamics $P({s}_{t+1}{s}_{t},{a}_{t})$ during metatraining. Note that the entire supervision comes from the provided unstructured demonstrations, which means we also do not assume further interactions with the experts as in Ross et al. (2011).
3.2 MetaIRL with Mutual Information Regularization over Context Variables
Under the framework of MaxEnt IRL, we first parametrize the context variable inference model ${q}_{\psi}(m\tau )$ and the reward function ${f}_{\theta}(s,a,m)$ (where the input $m$ is inferred by ${q}_{\psi}$), The induced $\theta $parametrized trajectory distribution is given by:
${p}_{\theta}(\tau =\{{\bm{s}}_{1:T},{\bm{a}}_{1:T}\}m)={\displaystyle \frac{1}{Z(\theta )}}\left[\eta ({s}_{1}){\displaystyle \prod _{t=1}^{T}}P({s}_{t+1}{s}_{t},{a}_{t})\right]\mathrm{exp}\left({\displaystyle \sum _{t=1}^{T}}{f}_{\theta}({s}_{t},{a}_{t},m)\right)$  (6) 
where $Z(\theta )$ is the partition function, $i.e.$, an integral over all possible trajectories. Without further constraints over $m$, directly applying AIRL to learning the reward function (by augmenting each component of AIRL with an additional context variable $m$ inferred by ${q}_{\psi}$) could simply ignore $m$, which is similar to the case of InfoGAIL (Li et al., 2017). Therefore, some connection between the reward function and the latent context variable $m$ need to be established. With MaxEnt IRL, a parametrized reward function will induce a trajectory distribution. From the perspective of information theory, the mutual information between the context variable $m$ and the trajectories sampled from the reward induced distribution will provide an ideal measure for such a connection.
Formally, the mutual information between two random variables $m$ and $\tau $ under joint distribution ${p}_{\theta}(m,\tau )=p(m){p}_{\theta}(\tau m)$ is given by:
${I}_{{p}_{\theta}}(m;\tau )={\mathbb{E}}_{m\sim p(m),\tau \sim {p}_{\theta}(\tau m)}[\mathrm{log}{p}_{\theta}(m\tau )\mathrm{log}p(m)]$  (7) 
where ${p}_{\theta}(\tau m)$ is the conditional distribution (Equation (6)), and ${p}_{\theta}(m\tau )$ is the corresponding posterior distribution.
As we do not have access to the prior distribution $p(m)$ and posterior distribution ${p}_{\theta}(m\tau )$, directly optimizing the mutual information in Equation (7) is intractable. Fortunately, we can leverage ${q}_{\psi}(m\tau )$ as a variational approximation to ${p}_{\theta}(m\tau )$ to reason about the uncertainty over tasks, as well as conduct approximate sampling from $p(m)$ (we will elaborate this later in Section 3.3). Formally, let ${p}_{{\pi}_{E}}(\tau )$ denote the expert trajectory distribution, we have the following desiderata:
Desideratum 1. Matching conditional distributions: ${\mathbb{E}}_{p(m)}\left[{D}_{\mathrm{KL}}({p}_{{\pi}_{E}}(\tau m){p}_{\theta}(\tau m))\right]=0$
Desideratum 2. Matching posterior distributions: ${\mathbb{E}}_{{p}_{\theta}(\tau )}[{D}_{\mathrm{KL}}({p}_{\theta}(m\tau ){q}_{\psi}(m\tau ))]=0$
The first desideratum will encourage the $\theta $induced conditional trajectory distribution to match the empirical distribution implicitly defined by the expert demonstrations, which is equivalent to the MLE objective in the MaxEnt IRL framework. Note that they also share the same marginal distribution over the context variable $p(m)$, which implies that matching the conditionals in Desideratum 1 will also encourage the joint distributions, conditional distributions ${p}_{{\pi}_{E}}(m\tau )$ and ${p}_{\theta}(m\tau )$, and marginal distributions over $\tau $ to be matched. The second desideratum will encourage the variational posterior ${q}_{\psi}(m\tau )$ to be a good approximation to ${p}_{\theta}(m\tau )$ such that ${q}_{\psi}(m\tau )$ can correctly infer the latent context variable given a new expert demonstration sampled from a new task.
With the mutual information (Equation (7)) being the objective, and Desideratum 1 and 2 being the constraints, the metainverse reinforcement learning with probabilistic context variables problem can be interpreted as a constrained optimization problem, whose Lagrangian dual function is given by:
$\underset{\theta ,\psi}{\mathrm{min}}{I}_{{p}_{\theta}}(m;\tau )+\alpha \cdot {\mathbb{E}}_{p(m)}\left[{D}_{\mathrm{KL}}({p}_{{\pi}_{E}}(\tau m){p}_{\theta}(\tau m))\right]+\beta \cdot {\mathbb{E}}_{{p}_{\theta}(\tau )}[{D}_{\mathrm{KL}}({p}_{\theta}(m\tau ){q}_{\psi}(m\tau ))]$  (8) 
With the Lagrangian multipliers taking specific values ($\alpha =1,\beta =1$) (Zhao et al., 2018), the above Lagrangian dual function can be rewritten as:
$\underset{\theta ,\psi}{\mathrm{min}}{\mathbb{E}}_{p(m)}\left[{D}_{\mathrm{KL}}({p}_{{\pi}_{E}}(\tau m){p}_{\theta}(\tau m))\right]+{\mathbb{E}}_{{p}_{\theta}(m,\tau )}[\mathrm{log}{\displaystyle \frac{p(m)}{{p}_{\theta}(m\tau )}}+\mathrm{log}{\displaystyle \frac{{p}_{\theta}(m\tau )}{{q}_{\psi}(m\tau )}}]$  
$\equiv $  $\underset{\theta ,\psi}{\mathrm{max}}{\mathbb{E}}_{p(m)}\left[{D}_{\mathrm{KL}}({p}_{{\pi}_{E}}(\tau m){p}_{\theta}(\tau m))\right]+{\mathbb{E}}_{m\sim p(m),\tau \sim {p}_{\theta}(\tau m)}[\mathrm{log}{q}_{\psi}(m\tau )]$  (9)  
$=$  $\underset{\theta ,\psi}{\mathrm{max}}{\mathbb{E}}_{p(m)}\left[{D}_{\mathrm{KL}}({p}_{{\pi}_{E}}(\tau m){p}_{\theta}(\tau m))\right]+{\mathcal{L}}_{\text{info}}(\theta ,\psi )$  (10) 
Here the negative entropy term ${H}_{p}(m)={\mathbb{E}}_{{p}_{\theta}(m,\tau )}[\mathrm{log}p(m)]={\mathbb{E}}_{p(m)}[\mathrm{log}p(m)]$ is omitted (in Eq. (9)) as it can be treated as a constant in the optimization procedure of parameters $\theta $ and $\psi $.
3.3 Achieving Tractability with SamplingBased Gradient Estimation
Note that Equation (10) cannot be evaluated directly, as the first term requires estimating the KL divergence between the empirical expert distribution and the energybased trajectory distribution ${p}_{\theta}(\tau m)$ (induced by the $\theta $parametrized reward function), and the second term requires sampling from it. For the purpose of optimizing the first term in Equation (10), as introduced in Section 2, we can employ the adversarial reward learning framework (Fu et al., 2017) to construct an efficient samplingbased approximation to the maximum likelihood objective. Note that different from the original AIRL framework, now the adaptive sampler ${\pi}_{\omega}(as,m)$ is additionally conditioned on the context variable $m$. Furthermore, we here introduce the following lemma, which will be helpful for deriving the optimization of the second term in Equation (10).
Lemma 1.
In context variable augmented Adversarial IRL (with the adaptive sampler being ${\pi}_{\omega}\mathit{}\mathrm{(}a\mathrm{}s\mathrm{,}m\mathrm{)}$ and the discriminator being ${D}_{\theta}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{,}m\mathrm{)}\mathrm{=}\frac{\mathrm{exp}\mathit{}\mathrm{(}{f}_{\theta}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{,}m\mathrm{)}\mathrm{)}}{\mathrm{exp}\mathit{}\mathrm{(}{f}_{\theta}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{,}m\mathrm{)}\mathrm{)}\mathrm{+}{\pi}_{\omega}\mathit{}\mathrm{(}a\mathrm{}s\mathrm{,}m\mathrm{)}}$) , under deterministic dynamics, when training the adaptive sampler ${\pi}_{\omega}$ with reward signal $\mathrm{(}\mathrm{log}\mathit{}{D}_{\theta}\mathrm{}\mathrm{log}\mathit{}\mathrm{(}\mathrm{1}\mathrm{}{D}_{\theta}\mathrm{)}\mathrm{)}$ to optimality, the trajectory distribution induced by ${\pi}_{\omega}^{\mathrm{*}}$ corresponds to the maximum entropy trajectory distribution with ${f}_{\theta}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{,}m\mathrm{)}$ serving as the reward function:
$${p}_{{\pi}_{\omega}^{*}}(\tau m)=\frac{1}{{Z}_{\theta}}\left[\eta ({s}_{1})\prod _{t=1}^{T}P({s}_{t+1}{s}_{t},{a}_{t})\right]\mathrm{exp}\left(\sum _{t=1}^{T}{f}_{\theta}({s}_{t},{a}_{t},m)\right)={p}_{\theta}(\tau m)$$ 
Proof.
See Appendix A. ∎
Now we are ready to introduce how to approximately optimize the second term of the objective in Equation (10) w.r.t. $\theta $ and $\psi $. First, we observe that the gradient of ${\mathcal{L}}_{\text{info}}(\theta ,\psi )$ w.r.t. $\psi $ is given by:
$\frac{\partial}{\partial \psi}}{\mathcal{L}}_{\text{info}}(\theta ,\psi )={\mathbb{E}}_{m\sim p(m),\tau \sim {p}_{\theta}(\tau m)}{\displaystyle \frac{1}{q(m\tau ,\psi )}}{\displaystyle \frac{\partial q(m\tau ,\psi )}{\partial \psi}$  (11) 
Thus to construct an estimate of the gradient in Equation (11), we need to obtain samples from the $\theta $induced trajectory distribution ${p}_{\theta}(\tau m)$. With Lemma 1, we know that when the adaptive sampler ${\pi}_{\omega}$ in AIRL is trained to optimality, we can use ${\pi}_{\omega}^{*}$ to construct samples, as the trajectory distribution ${p}_{{\pi}_{\omega}^{*}}(\tau m)$ matches the desired distribution ${p}_{\theta}(\tau m)$.
Also note that the expectation in Equation (11) is also taken over the prior task distribution $p(m)$. In cases where we have access to the groundtruth prior distribution, we can directly sample $m$ from it and use ${p}_{{\pi}_{\omega}^{*}}(\tau m)$ to construct a gradient estimation. For the most general case, where we do not have access to $p(m)$ but instead have expert demonstrations sampled from ${p}_{{\pi}_{E}}(\tau )$, we use the following generative process:
$$\tau \sim {p}_{{\pi}_{E}(\tau )},m\sim {q}_{\psi}(m\tau )$$  (12) 
to synthesize latent context variables, which approximates the prior task distribution when $\theta $ and $\psi $ are trained to optimality.
To optimize ${\mathcal{L}}_{\text{info}}(\theta ,\psi )$ w.r.t. $\theta $, which is an important step of updating the reward function parameters such that it encodes the information of the latent context variable, different from the optimization of Equation (11), we cannot directly replace ${p}_{\theta}(\tau m)$ with ${p}_{{\pi}_{\omega}}(\tau m)$. The reason is that we can only use the approximation of ${p}_{\theta}$ to do inference (i.e. computing the value of an expectation). When we want to optimize an expectation (${\mathcal{L}}_{\text{info}}(\theta ,\psi )$) w.r.t. $\theta $ and the expectation is taken over ${p}_{\theta}$ itself, we cannot instead replace ${p}_{\theta}$ with ${\pi}_{\omega}$ to do the sampling for estimating the expectation. In the following, we discuss how to estimate the gradient of ${\mathcal{L}}_{\text{info}}(\theta ,\psi )$ w.r.t. $\theta $ with empirical samples from ${\pi}_{\omega}$.
Lemma 2.
The gradient of ${\mathrm{L}}_{\text{\mathit{i}\mathit{n}\mathit{f}\mathit{o}}}\mathit{}\mathrm{(}\theta \mathrm{,}\psi \mathrm{)}$ w.r.t. $\theta $ can be estimated with:
${\mathbb{E}}_{m\sim p(m),\tau \sim {p}_{{\pi}_{\omega}^{*}}(\tau m)}\left[\mathrm{log}{q}_{\psi}(m\tau )\left[{\displaystyle \sum _{t=1}^{T}}{\displaystyle \frac{\partial}{\partial \theta}}{f}_{\theta}({s}_{t},{a}_{t},m){\mathbb{E}}_{{\tau}^{\prime}\sim {p}_{{\pi}_{\omega}^{*}}(\tau m)}{\displaystyle \sum _{t=1}^{T}}{\displaystyle \frac{\partial}{\partial \theta}}{f}_{\theta}({s}_{t}^{\prime},{a}_{t}^{\prime},m)\right]\right]$ 
When $\omega $ is trained to optimality, the estimation is unbiased.
Proof.
See Appendix B. ∎
With Lemma 2, as before, we can use the generative process in Equation (12) to sample $m$ and use the conditional trajectory distribution ${p}_{{\pi}_{\omega}^{*}}(\tau m)$ to sample trajectories for estimating $\frac{\partial}{\partial \theta}{\mathcal{L}}_{\text{info}}(\theta ,\psi )$. The overall training objective of PEMIRL is:
$\underset{\omega}{\mathrm{min}}\underset{\theta ,\psi}{\mathrm{max}}{\mathbb{E}}_{{\tau}_{E}\sim {p}_{{\pi}_{E}}(\tau ),m\sim {q}_{\psi}(m{\tau}_{E}),(s,a)\sim {\rho}_{{\pi}_{\omega}}(s,am)}\mathrm{log}(1{D}_{\theta}(s,a,m))+$  
$\mathrm{}\mathit{\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{0.25em}}{\mathbb{E}}_{{\tau}_{E}\sim {p}_{{\pi}_{E}}(\tau ),m\sim {q}_{\psi}(m{\tau}_{E})}\mathrm{log}({D}_{\theta}(s,a,m))+{\mathcal{L}}_{\text{info}}(\theta ,\psi )$  (13)  
$\text{where}{D}_{\theta}(s,a,m)=\mathrm{exp}({f}_{\theta}(s,a,m))/(\mathrm{exp}({f}_{\theta}(s,a,m))+{\pi}_{\omega}(as,m))$ 
We summarize the metatraining procedure in Algorithm 1 and the metatest procedure in Appendix C.
4 Related Work
Inverse reinforcement learning (IRL), first introduced by Ng and Russell (2000), is the problem of learning reward functions directly from expert demonstrations. Prior work tackling IRL include marginbased methods Abbeel and Ng (2004); Ratliff et al. (2006) and maximum entropy (MaxEnt) methods Ziebart et al. (2008). Marginbased methods suffer from being an underdefined problem, while MaxEnt requires the algorithm to solve the forward RL problem in the inner loop, making it challenging to use in nontabular settings. Recent works have scaled MaxEnt IRL to large function approximators, such as neural networks, by only partially solving the forward problem in the inner loop, developing an adversarial framework for IRL Finn et al. (2016, 2016); Fu et al. (2017); Peng et al. (2018). Other imitation learning approaches Ho and Ermon (2016); Li et al. (2017); Hausman et al. (2017); Kuefler and Kochenderfer (2018) are also based on the adversarial framework, but they do not recover a reward function. We build upon the ideas in these singletask IRL works. Instead of considering the problem of learning reward functions for a single task, we aim at the problem of inferring a reward that is disentangled from the environment dynamics and can quickly adapt to new tasks from a single demonstration by leveraging prior data.
We base our work on the problem of metalearning. Prior work has proposed memorybased methods Duan et al. (2016); Santoro et al. (2016); Munkhdalai and Yu (2017); Rakelly et al. (2019) and methods that learn an optimizer and/or a parameter initialization Andrychowicz et al. (2016); Li and Malik (2017); Finn et al. (2017a). We adopt a memorybased metalearning method similar to Rakelly et al. (2019), which uses a deep latent variable generative model Kingma and Welling (2013) to infer different tasks from demonstrations. While prior multitask and metaRL methods Hausman et al. (2018); Rakelly et al. (2019); Sæmundsson et al. (2018) have investigated the effectiveness of applying latent variable generative models to learning task embeddings, we focus on the IRL problem instead. MetaIRL Xu et al. (2018); Gleave and Habryka (2018) incorporates metalearning and IRL, showing fast adaptation of the reward functions to unseen tasks. Unlike these approaches, our method is not restrictred to discrete tabular settings and does not require access to grouped demonstrations sampled from a task distribution. Meanwhile, oneshot imitation learning Duan et al. (2017); Finn et al. (2017b); Yu et al. (2018b, a) demonstrates impressive results on learning new tasks using a single demonstration; yet, they also require paired demonstrations from each task and hence need prior knowledge on the task distribution. More importantly, oneshot imitation learning approaches only recover a policy, and cannot use additional trials to continue to improve, which is possible when a reward function is inferred instead. Several prior approaches for multitask imitation learning Li et al. (2017); Hausman et al. (2017); Sharma et al. (2018) propose to use unstructured demonstrations without knowing the task distribution, but they neither study quick generalization to new tasks nor provide a reward function. Our work is thus driven by the goal of extending metaIRL to addressing challenging highdimensional control tasks with the help of an unstructured demonstration dataset.
5 Experiments
In this section, we seek to investigate the following two questions: (1) Can PEMIRL learn a policy with competitive fewshot generalization abilities compared to oneshot imitation learning methods using only unstructured demonstrations? (2) Can PEMIRL efficiently infer robust reward functions of new continuous control tasks where oneshot imitation learning fails to generalize, enabling an agent to continue to improve with more trials?
We evaluate our method on four simulated domains using the Mujoco physics engine Todorov et al. (2012). To our knowledge, there’s no prior work on designing metaIRL or oneshot imitation learning methods for complex domains with highdimensional continuous stateaction spaces with unstructured demonstrations. Hence, we also designed the following variants of existing stateoftheart (oneshot) imitation learning and IRL methods so that they can be used as fair comparisons to our method:

•
AIRL: The original AIRL algorithm without incorporating latent context variables, trained across all demonstrations.

•
MetaImitation Learning with Latent Context Variables (MetaIL): As in (Rakelly et al., 2019), we use the inference model ${q}_{\psi}(m\tau )$ to infer the context of a new task from a single demonstrated trajectory, denoted as $\widehat{m}$, and then train the conditional imitaiton policy ${\pi}_{\omega}(as,\widehat{m})$ using the same demonstration. This approach also resembles Duan et al. (2017).

•
MetaInfoGAIL: Similar to the method above, except that an additional discriminator $D(s,a)$ is introduced to distinguish between expert and sample trajectories, and trained along with the conditional policy using InfoGAIL Li et al. (2017) objective.
We use trust region policy optimization (TRPO) Schulman et al. (2015) as our policy optimization algorithm across all methods. We collect demonstrations by training experts with TRPO using ground truth reward. However, the ground truth reward is not available to imitation learning and IRL algorithms. We provide full hyperparameters, architecture information, data efficiency, and experimental setup details in Appendix F. We also include ablation studies on sensitivity of the latent dimensions, importance of the mutual information objective and the performance on stochastic environments in Appendix E. Full video results are on the anonymous supplementary website^{1}^{1} 1 Video results can be found at: https://sites.google.com/view/pemirl and our code is opensourced on GitHub^{2}^{2} 2 Our implementation of PEMIRL can be found at: https://github.com/ermongroup/MetaIRL.
5.1 Policy Performance on Test Tasks
Point Maze  Ant  Sweeper  Sawyer Pusher  

Expert  $5.21\pm 0.93$  $968.80\pm 27.11$  $50.86\pm 4.75$  $23.36\pm 2.54$ 
Random  $51.39\pm 10.31$  $55.65\pm 18.39$  $259.71\pm 11.24$  $106.88\pm 18.06$ 
AIRL Fu et al. (2017)  $18.15\pm 3.17$  $127.61\pm 27.34$  $152.78\pm 7.39$  $51.56\pm 8.57$ 
MetaIL  $\mathbf{6.68}\pm 1.51$  $218.53\pm 26.48$  $89.02\pm 7.06$  $28.13\pm 4.93$ 
MetaInfoGAIL  $7.66\pm 1.85$  $\mathbf{871.93}\pm 31.28$  $87.06\pm 6.57$  $27.56\pm 4.86$ 
PEMIRL (ours)  $7.37\pm 1.02$  $846.18\pm 25.81$  $\mathbf{74.17}\pm 5.31$  $\mathbf{27.16}\pm 3.11$ 
We first answer our first question by showing that our method is able to learn a policy that can adapt to test tasks from a single demonstration, on four continuous control domains: Point Maze Navigation: In this domain, a pointmass needs to navigate around a barrier to reach the goal. Different tasks correspond to different goal positions and the reward function measures the distance between the pointmass and the goal position; Ant: Similar to Finn et al. (2017a), this locomotion task requires fast adaptation to walking directions of the ant where the ant needs to learn to move backward or forward depending on the demonstration; Sweeper: A robot arm needs to sweep an object to a particular goal position. Fast adaptation of this domain corresponds to different goal locations in the plane; Sawyer Pusher: A simulated Sawyer robot is required to push a mug to a variety of goal positions and generalize to unseen goals. We illustrate the setup for these experimental domains in Figure 1.
We summarize the results in Table 1. PEMIRL achieves comparable imitation performance compared to MetaIL and MetaInfoGAIL, while AIRL is incapable of handling multitask scenarios without incorporating the latent context variables.
5.2 Reward Adaptation to Challenging Situations
After demonstrating that the policy learned by our method is able to achieve competitive “oneshot” generalization ability, we now answer the second question by showing PEMIRL learns a robust reward that can adapt to new and more challenging settings where the imitation learning methods and the original AIRL fail. Specifically, after providing the demonstration of an unseen task to the agent, we change the underlying environment dynamics but keep the same task goal. In order to succeed in the task with new dynamics, the agent must correctly infer the underlying goal of the task instead of simply mimicking the demonstration. We show the effectiveness of our reward generalization by training a new policy with TRPO using the learned reward functions on the new task.
PointMaze Navigation with a Shifted Barrier. Following the setup of Fu et al. (2017), at metatest time, after showing a demonstration moving towards a new target position, we change the position of the barrier from left to right. As the agent must adapt by reaching the target with a different path from what was demonstrated during metatraining, it cannot succeed without correctly inferring the true goal (the target position in the maze) and learning from trialanderror. As a result, all direct policy generalization approaches fail as all the policies are still directing the pointmass to the right side of the maze. As shown in Figure 2, PEMIRL learns disentangled reward functions that successfully infer the underlying goal of the new task without much reward shaping. Such reward functions enable the RL agent to bypass the right barrier and reach the true goal position. The RL agent trained with the reward learned by AIRL also fail to bypass the barrier and navigate to the target position, as without incorporating the latent context variables and treating the demonstration as multimodal, AIRL learns an “average” reward and policy among different tasks. We also use the output of the discriminator of MetaInfoGAIL as reward signals and evaluate its adaptation performance. The agent trained by this reward fails to complete the task since MetaInfoGAIL does not explicitly optimize for reward learning and the discriminator output converges to uninformative uniform distribution at convergence.
Method  PointMazeShift  DisabledAnt  

Policy Generalization  MetaIL  $28.61\pm 3.71$  $27.86\pm 10.31$ 
MetaInfoGAIL  $29.40\pm 3.05$  $51.08\pm 4.81$  
PEMIRL  $28.93\pm 3.59$  $46.77\pm 5.54$  
Reward Adaptation  AIRL  $29.07\pm 4.12$  $76.21\pm 10.35$ 
MetaInfoGAIL  $29.72\pm 3.11$  $38.73\pm 6.41$  
PEMIRL (ours)  $\mathbf{9.04}\pm 1.09$  $\mathbf{152.62}\pm 11.75$  
Expert  $5.37\pm 0.86$  $331.17\pm 17.82$ 
Disabled Ant Walking. As in Fu et al. (2017), we disable and shorten two front legs of the ant such that it cannot walk without changing its gait to a large extent. Similar to PointMazeShift, all imitaiton policies fail to maneuver the disabled ant to the right direction. As shown in Figure 3, reward functions learned by PEMIRL encourage the RL policy to orient the ant towards the demonstrated direction and move along that direction using two healthy legs, which is only possible when the inferred reward corresponds to the true underlying goal and is disentangled with the dynamics. In contrast, the learned reward of original AIRL as well as the discriminator output of MetaInfoGAIL cannot infer the underlying goal of the task and provide precise supervision signal, which leads to the unsatisfactory performance of the induced RL policies. Quantitative results are presented in Table 2.
6 Conclusion
In this paper, we propose a new metainverse reinforcement learning algorithm, PEMIRL, which is able to efficiently infer robust reward functions that are disentangled from the dynamics and highly correlated with the groundtruth rewards under metalearning settings. To our knowledge, PEMIRL is the first modelfree MetaIRL algorithm that can achieve this and scale to complex domains with continuous stateaction spaces. PEMIRL generalizes to new tasks by performing inference over a latent context variable with a single demonstration, on which the recovered policy and reward function are conditioned. Extensive experimental results demonstrate the scalability and effectiveness of our method against strong baselines.
Acknowledgments
This research was supported by Toyota Research Institute, NSF (#1651565, #1522054, #1733686), ONR (N000141912145), AFOSR (FA9550 1910024). The authors would like to thank Chris Cundy for discussions over the paper draft.
References
 Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twentyfirst international conference on Machine learning, ACM, pp. 1. Cited by: §1, §4.
 Concrete problems in AI safety. arXiv preprint arXiv:1606.06565. Cited by: §1.
 Learning to learn by gradient descent by gradient descent. arXiv preprint arXiv:1606.04474. Cited by: §4.
 Learning a synaptic learning rule. In IJCNN91Seattle International Joint Conference on Neural Networks, Cited by: §1.
 Oneshot imitation learning. Neural Information Processing Systems (NIPS). Cited by: §4, 2nd item.
 RL$^2$: fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779. Cited by: §1, §3.1, §4.
 Modelagnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning, Cited by: §1, §3.1, §4, §5.1.
 A connection between generative adversarial networks, inverse reinforcement learning, and energybased models. arXiv preprint arXiv:1611.03852. Cited by: §2, §4.
 Guided cost learning: deep inverse optimal control via policy optimization. In International Conference on Machine LearningInternational Conference on Machine Learning, pp. 49–58 (en). Cited by: §4.
 Oneshot visual imitation learning via metalearning. Cited by: §4.
 Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248. Cited by: §F.1, §F.3, §1, §2, §3.3, §4, §5.2, §5.2, Table 1.
 Multitask maximum entropy inverse reinforcement learning. arXiv preprint arXiv:1805.08882. Cited by: §1, §4.
 Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
 Multimodal imitation learning from unstructured demonstrations using generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 1235–1245. Cited by: §4, §4.
 Learning an embedding space for transferable robot skills. Cited by: §4.
 Generative adversarial imitation learning. In Advances in Neural Information Processing Systems 29, pp. 4565–4573. Cited by: §1, §4.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §4.
 Burnin demonstrations for multimodal imitation learning. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’18. Cited by: §4.
 Reinforcement learning and control as probabilistic inference: tutorial and review. arXiv preprint arXiv:1805.00909. Cited by: Appendix A.
 Learning to optimize neural nets. arXiv preprint arXiv:1703.00441. Cited by: §4.
 Infogail: interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems, pp. 3812–3822. Cited by: §3.2, §4, §4, 3rd item.
 Meta networks. International Conference on Machine Learning (ICML). Cited by: §4.
 Policy invariance under reward transformations: theory and application to reward shaping. In ICML, Vol. 99, pp. 278–287. Cited by: §3.1.
 Algorithms for inverse reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00. Cited by: §1, §4.
 Variational discriminator bottleneck: improving imitation learning, inverse rl, and gans by constraining information flow. arXiv preprint arXiv:1810.00821. Cited by: §4.
 Efficient offpolicy metareinforcement learning via probabilistic context variables. arXiv preprint arXiv:1903.08254. Cited by: §1, §4, 2nd item.
 Maximum margin planning. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06. Cited by: §4.
 A reduction of imitation learning and structured prediction to noregret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. Cited by: §3.1.
 Meta reinforcement learning with latent variable gaussian processes. arXiv preprint arXiv:1803.07551. Cited by: §4.
 Metalearning with memoryaugmented neural networks. In International Conference on Machine Learning (ICML), Cited by: §4.
 Computational approaches to motor learning by imitation. Philosophical Transactions of the Royal Society of London B: Biological Sciences. Cited by: §1.
 Evolutionary principles in selfreferential learning, or on learning how to learn: the metameta… hook. Ph.D. Thesis, Technische Universität München. Cited by: §1.
 Trust region policy optimization. International Conference on Machine Learning. Cited by: §5.
 Directedinfo GAIL: learning hierarchical policies from unsegmented demonstrations using directed information. arXiv preprint arXiv:1810.01266. Cited by: §4.
 Mujoco: a physics engine for modelbased control. In International Conference on Intelligent Robots and Systems (IROS), Cited by: §5.
 Learning a prior over intent via metainverse reinforcement learning. arXiv preprint arXiv:1805.12573. Cited by: §1, §4.
 Oneshot hierarchical imitation learning of compound visuomotor tasks. arXiv preprint arXiv:1810.11043. Cited by: §4.
 Oneshot imitation from observing humans via domainadaptive metalearning. Robotics: Science and Systems (R:SS). Cited by: §1, §4.
 Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. arXiv preprint arXiv:1710.04615. Cited by: §1.
 The information autoencoding family: a lagrangian perspective on latent variable generative models. arXiv preprint arXiv:1806.06514. Cited by: §3.2.
 Maximum entropy inverse reinforcement learning.. In Aaai, Vol. 8, pp. 1433–1438. Cited by: §1, §2, §4.
 Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Cited by: §1.
Appendix A Proof of Lemma 1
From Section 2.3 and 2.4 in [Levine, 2018], we know that the policy whose induced trajectory distribution is Equation (6) takes the following energybased form:
${\pi}_{\theta}({a}_{t}{s}_{t},m)$  $=\mathrm{exp}({Q}_{\text{soft}}({s}_{t},{a}_{t},m){V}_{\text{soft}}({s}_{t},m))$  
${Q}_{\text{soft}}({s}_{t},{a}_{t},m)$  $={f}_{\theta}({s}_{t},{a}_{t},m)+\mathrm{log}{\mathbb{E}}_{{s}_{t+1}\sim P(\cdot {s}_{t},{a}_{t},m)}[\mathrm{exp}({V}_{\text{soft}}({s}_{t+1},m))]$  
${V}_{\text{soft}}({s}_{t},m)$  $=\mathrm{log}{\displaystyle {\int}_{\mathcal{A}}}\mathrm{exp}({Q}_{\text{soft}}({s}_{t},{a}^{\prime},m))\mathit{d}{a}^{\prime}$ 
which corresponds to the optimal policy to the following entropy regularized reinforcement learning problem (for a certain value of $m$):
$\underset{\pi}{\mathrm{max}}{\mathbb{E}}_{\pi}\left[{\displaystyle \sum _{t=1}^{T}}{f}_{\theta}({s}_{t},{a}_{t},m)\mathrm{log}\pi ({a}_{t}{s}_{t},m)\right]$  (14) 
From Section 2, we know that Equation (14) is exactly the training objective for the adaptive sampler ${\pi}_{\omega}$ in AIRL. Thus, the trajectory distribution of the optimal policy ${\pi}_{{\omega}^{*}}$ matches ${p}_{\theta}(\tau m)$ defined in Equation (6).
Appendix B Proof of Lemma 2
First, the gradient of ${\mathcal{L}}_{\text{info}}(\theta ,\psi )$ w.r.t. $\theta $ can be written as:
$\frac{\partial}{\partial \theta}}{\mathcal{L}}_{\text{info}}(\theta ,\psi )={\mathbb{E}}_{m\sim p(m),\tau \sim p(\tau m,\theta )}\mathrm{log}q(m\tau ,\psi ){\displaystyle \frac{\partial}{\partial \theta}}\mathrm{log}{p}_{\theta}(\tau m)$  (15) 
As ${p}_{\theta}(\tau m)$ is an energybased distribution (Equation (6)), we need to derive the gradient of $\mathrm{log}p(\tau m,\theta )$ w.r.t. $\theta $:
$\frac{\partial}{\partial \theta}}\mathrm{log}p(\tau m,\theta )$  $={\displaystyle \frac{\partial}{\partial \theta}}\left[\mathrm{log}\left(\eta ({s}_{1}){\displaystyle \prod _{t=1}^{T}}P({s}_{t+1}{s}_{t},{a}_{t})\right)+{\displaystyle \sum _{t=1}^{T}}{f}_{\theta}({s}_{t},{a}_{t},m)\mathrm{log}Z(\theta )\right]$  (16)  
$={\displaystyle \sum _{t=1}^{T}}{\displaystyle \frac{\partial}{\partial \theta}}{f}_{\theta}({s}_{t},{a}_{t},m){\displaystyle \frac{\partial}{\partial \theta}}\mathrm{log}Z(\theta )$  (17)  
$={\displaystyle \sum _{t=1}^{T}}{\displaystyle \frac{\partial}{\partial \theta}}{f}_{\theta}({s}_{t},{a}_{t},m){\mathbb{E}}_{\tau \sim p(\tau m,\theta )}\left[{\displaystyle \sum _{t=1}^{T}}{\displaystyle \frac{\partial}{\partial \theta}}{f}_{\theta}({s}_{t},{a}_{t},m)\right]$  (18) 
Substituting Equation (18) into Equation (15), we get:
${\mathbb{E}}_{m\sim p(m),\tau \sim {p}_{\theta}(\tau m)}\left[\mathrm{log}{q}_{\psi}(m\tau )\left[{\displaystyle \sum _{t=1}^{T}}{\displaystyle \frac{\partial}{\partial \theta}}{f}_{\theta}({s}_{t},{a}_{t},m){\mathbb{E}}_{{\tau}^{\prime}\sim {p}_{\theta}(\tau m)}{\displaystyle \sum _{t=1}^{T}}{\displaystyle \frac{\partial}{\partial \theta}}{f}_{\theta}({s}_{t}^{\prime},{a}_{t}^{\prime},m)\right]\right]$ 
With Lemma 1, we know that when $\omega $ is trained to optimality, we can sample from ${p}_{{\pi}_{\omega}^{*}}(\tau m)$ to construct an unbiased gradient estimation.
Appendix C MetaTesting Procedure of PEMIRL
We summarize the metatest stage of PEMIRL for adapting reward functions to new tasks in Algorithm 2.
Appendix D Graphical Model of PEMIRL
Here we show the graphical model of the PEMIRL framework in Figure 4.
Appendix E Ablation Studies
In this section, we perform ablation studies on the sensitivity of the latent dimensions, importance of the mutual information loss (${\mathcal{L}}_{\text{info}}$) term, and stochasticity of the environment. We conduct each ablation study on the PointMazeShift environment to evaluation the reward adaptation performance.
Sensitivity of the latent dimension. We first investigate the sensitivity of different latent dimensions by running PEMIRL with latent dimension picked from $\{1,3,5\}$ on PointMazeShift where the groundtruth latent dimension is $3$. The results are summarized in Table 5. We can observe that PEMIRL with various latent dimension specifications all outperform the best baseline (return 28.61) stably and is hence robust to dimension misspecifications.
Importance of ${\mathrm{L}}_{\text{\mathbf{i}\mathbf{n}\mathbf{f}\mathbf{o}}}$. As shown in Table 5, the reward function learned by PEMIRL without the mutual information objective failed to induce a good policy in the reward adaptation setting, which demonstrates the importance of using ${\mathcal{L}}_{\text{info}}$.
Performance on stochastic environment. We create a stochastic version of PointMazeShift (maze size: $60\times 100$ cm) by changing its deterministic transition dynamics into a stochastic one. Specifically, $p({s}_{t+1}{s}_{t},{a}_{t})$ is now realized as a Gaussian with standard deviation being 1 cm. As shown in Table 5, the average return of PEMIRL outperforms the best baseline MetaIL by a large margin.
Appendix F Additional Experimental Details
F.1 Network Architectures
For all methods except AIRL, ${q}_{\psi}(m\tau )$ and ${\pi}_{\omega}(as,m)$ are represented as $2$layer fullyconnected neural networks with $128$ and $64$ hidden units respectively and ReLU as the activation function.
Following Fu et al. [2017], to alleviate the reward ambiguity problem, we represent the reward function with two components (a contextdependent disentangled reward estimator ${r}_{\theta}(s,m)$ and a contextdependent potential function ${h}_{\varphi}(s,m)$):
$${f}_{\theta ,\varphi}({s}_{t},{a}_{t},{s}_{t+1},m)={r}_{\theta}({s}_{t},m)+\gamma {h}_{\varphi}({s}_{t+1},m){h}_{\varphi}({s}_{t},m)$$ 
Here ${r}_{\theta}(s,m)$ and ${h}_{\varphi}(s,m)$ are realized as a $2$layer fullyconnected neural networks with $32$ hidden units.
F.2 Environment Details
PointMaze. The groundtruth reward corresponds to negative distance toward the goal position as well as controlling the pointmass from moving too fast. We use $100$ metatraining tasks and $30$ metatraining tasks.
Ant. The groundtruth reward corresponds to moving as far as possible forward or backward without being flipped. We have $2$ tasks in this domain.
Sweeper. The groundtruth reward is the negative distance from the sweeper to the object plus the negative distance from the object to the goal position. We train all methods on $100$ metatraining tasks and test them on $30$ metatest tasks.
Sawyer Pushing. The groundtruth reward in this domain is similar to Sweeper, and we also use $100$ metatraining tasks and $30$ metatest tasks.
F.3 Training Details
Training the policy. During training TRPO, we use an entropy regularizer $1.0$ for PointMaze, and $0.1$ for the other three domains. We find that adding an imitation objective in PEMIRL that maximizes the loglikelihood of the sampled expert trajectory conditioned on the latent context variable inferred by ${q}_{\psi}$ with scaling factor $0.01$ accelerates policy training.
Training the inference network and the reward model. We train ${q}_{\psi}(m\tau )$, ${r}_{\theta}(s,m)$ and ${h}_{\varphi}(s,m)$ using the Adam optimizer with default hyperparameters.
Scaling up the mutual information regularization. Note that in Equation 10, $\beta $ does not necessarily need to be equal to $1$. Adjusting $\beta $ is equivalent to scaling ${\mathcal{L}}_{\text{info}}(\theta ,\psi )$. We scale ${\mathcal{L}}_{\text{info}}(\theta ,\psi )$ by $0.1$ for all of our experiments.
Policy and inference network initialization. We initialize and ${q}_{\psi}(m\tau )$ using MetaIL discussed in Section 5 while randomly initializing the policy ${\pi}_{\omega}(as,m)$.
Stabilizing adversarial training. As in Fu et al. [2017], we mix policy samples generated from previous $20$ training iterations and use them as negatives when training the disriminator. We find that such a strategy prevents the discriminator from overfitting to samples from the current iteration.
F.4 Data Efficiency
During metatraining, for the PointMaze environment, it takes about 32M simulation steps to converge (similar to other methods such as MetaInfoGAIL that takes 28M), which amounts to about 2 hours on one Nvidia TitanXp GPU; for the Ant environment, it takes about 13.8M simulation steps (MetaInfoGAIL takes 12M) and about 40 hours on the same hardware (the stateaction dimension is much larger than that of PointMaze).
At metatesting phase, the data efficiency of PEMIRL is comparable to RL training with the oracle groundtruth reward as shown in Table 6.
PointMazeShift  DisabledAnt  

RL w/ oracle reward  4M env steps  15M env steps 
PEMIRL  5.4M env steps  18M env steps 