Abstract
Inverse reinforcement learning (IRL) is used to infer the reward functionfrom the actions of an expert running a Markov Decision Process (MDP). A novelapproach using variational inference for learning the reward function isproposed in this research. Using this technique, the intractable posteriordistribution of the continuous latent variable (the reward function in thiscase) is analytically approximated to appear to be as close to the prior beliefwhile trying to reconstruct the future state conditioned on the current stateand action. The reward function is derived using a wellknown deep generativemodel known as Conditional Variational Autoencoder (CVAE) with Wassersteinloss function, thus referred to as Conditional Wasserstein AutoencoderIRL(CWAEIRL), which can be analyzed as a combination of the backward and forwardinference. This can then form an efficient alternative to the previousapproaches to IRL while having no knowledge of the system dynamics of theagent. Experimental results on standard benchmarks such as objectworld andpendulum show that the proposed algorithm can effectively learn the latentreward function in complex, highdimensional environments.
Quick Read (beta)
CWAEIRL: Formulating a supervised approach to Inverse Reinforcement Learning problem
Abstract
Inverse reinforcement learning (IRL) is used to infer the reward function from the actions of an expert running a Markov Decision Process (MDP). A novel approach using variational inference for learning the reward function is proposed in this research. Using this technique, the intractable posterior distribution of the continuous latent variable (the reward function in this case) is analytically approximated to appear to be as close to the prior belief while trying to reconstruct the future state conditioned on the current state and action. The reward function is derived using a wellknown deep generative model known as Conditional Variational Autoencoder (CVAE) with Wasserstein loss function, thus referred to as Conditional Wasserstein AutoencoderIRL (CWAEIRL), which can be analyzed as a combination of the backward and forward inference. This can then form an efficient alternative to the previous approaches to IRL while having no knowledge of the system dynamics of the agent. Experimental results on standard benchmarks such as objectworld and pendulum show that the proposed algorithm can effectively learn the latent reward function in complex, highdimensional environments.
1 Introduction
Reinforcement learning, formalized as Markov decision process (MDP), provides a general solution to sequential decision making, where given a state, the agent takes an optimal action by maximizing the longterm reward from the environment [4]. However, in practice, defining a reward function that weighs the features of the state correctly can be challenging, and techniques like reward shaping are often used to solve complex realworld problems [19]. The process of inferring the reward function given the demonstrations by an expert is defined as inverse reinforcement learning (IRL) or apprenticeship learning [20, 1].
The fundamental problem with IRL lies in the fact that the algorithm is under defined and infinitely different reward functions can yield the same policy [8]. Previous approaches have used preferences on the reward function to address the nonuniqueness. [20] suggested reward function that maximizes the difference in the values of the expert’s policy and the second best policy. [33] adopted the principle of maximum entropy for learning the policy whose feature expectations are constrained to match those of the expert’s. [22] applied the structured maxmargin optimization to IRL and proposed a method for finding the reward function that maximizes the margin between expert’s policy and all other policies. [18] unified a direct method that minimizes deviation from the expert’s behavior and an indirect method that finds an optimal policy from the learned reward function using IRL. [26] used a gametheoretic framework to find a policy that improves with respect to an expert’s.
Another challenge for IRL is that some variant of the forward reinforcement learning problem needs to be solved in a tightly coupled manner to obtain the corresponding policy, and then compare this policy to the demonstrated actions [8]. Most early IRL algorithms proposed solving an MDP in the inner loop [20, 1, 33]. This requires perfect knowledge of the expert’s dynamics which are almost always impossible to have. Several works have proposed to relax this requirement, for example by learning a value function instead of a cost [27], solving an approximate local control problem [15] or generating a discrete graph of states [7]. However, all these methods still require some partial knowledge of the system dynamics.
Most of the early research in this field has expressed the reward function as a weighted linear combination of hand selected features [20, 21, 33]. Nonparametric methods such as Gaussian Processes (GPs) have also been used for potentially complex, nonlinear reward functions [16]. While in principle this helps extend the IRL paradigm to flexibly account for nonlinear reward approximation; the use of kernels simultaneously leads to higher sample size requirements. Universal function approximators such as nonlinear deep neural network have been proposed recently [31, 8]. This moves away from using handcrafted features and helps in learning highly nonlinear reward functions but they still need the agent in the loop to generate new samples to ”guide” the cost to the optimal reward function. [10] has recently proposed deriving an adversarial reward learning formulation which disentangles the reward learning process by a discriminator trained via binary regression data and uses policy gradient algorithms to learn the policy as well.
The Bayesian IRL (BIRL) algorithm proposed by [21] uses the expert’s actions as evidence to update the prior on reward functions. The reward learning and apprenticeship learning steps are solved by performing the inference using a modified Markov Chain Monte Carlo (MCMC) algorithm. [32] described an expectationmaximization (EM) approach for solving the BIRL problem, referring to it as the Robust BIRL (RBIRL). Variational Inference (VI) has been used as an efficient and alternative strategy to MCMC sampling for approximating posterior densities [12, 30]. Variational Autoencoder (VAE) was proposed by [13] as a neural network version of the approximate inference model. The loss function of the VAE is given in such a way that it automatically tries to maximize the likelihood of the data given the current latent variables (reconstruction loss), while encouraging the latent variables to be close to our prior belief of how the variables should look like (KullbeckLiebler divergence loss). This can be seen as an generalization of EM from maximum aposteriori (MAP) estimation of the single parameter to an approximation of complete posterior distribution. Conditional VAE (CVAE) has been proposed by [24] to develop a deep conditional generative model for structured output prediction using Gaussian latent variables. Wasserstein AutoEncoder (WAE) has been proposed by [28] to utilize Wasserstein loss function in place of KL divergence loss for robustly estimating the loss in case of small samples, where VAE fails.
This research is motivated by the observation that IRL can be formulated as a supervised learning problem with latent variable modelling. This intuition is not unique. It has been proposed by [14] using the Cascaded Supervised IRL (CSI) approach. However, CSI uses nongeneralizable heuristics to classify the dataset and find the decision rule to estimate the reward function. Here, I propose to utilize the CVAE framework with Wasserstein loss function to determine the nonlinear, continuous reward function utilizing the expert trajectories without the need for system dynamics. The encoder step of the CVAE is used to learn the original reward function from the next state conditioned on the current state and action. The decoder step is used to recover the next state given the current state, action and the latent reward function.
The likelihood loss, composed of the reconstruction error and the Wasserstein loss, is then fed to optimize the CVAE network. The Gaussian distribution is used here as the prior distribution; however, [21] has described various other prior distributions which can be used based on the class of problem being solved. Since, the states chosen are supplied by the expert’s trajectories, the CWAEIRL algorithm is run only on those states without the need to run an MDP or have the agent in the loop. Two novel contributions are made in this paper:

•
Proposing a generative model such as an autoencoder for estimating the reward function leads to a more effective and efficient algorithm with locally optimal, analytically approximate solution.

•
Using only the expert’s stateaction trajectories provides a robust generative solution without any knowledge of system dynamics.
Section 2 gives the background on the concepts used to build our model; Section 3 describes the proposed methodology; Section 4 gives the results and Section 5 provides the discussion and conclusions.
2 Preliminaries
2.1 Markov Decision Process (MDPs)
In the reinforcement learning problem, at time t, the agent observes a state, ${s}_{t}\in $ S, and takes an action, ${a}_{t}\in $ A; thereby receiving an immediate scalar reward ${r}_{t}$ and moving to a new state ${s}_{t+1}$. The model’s dynamics are characterized by state transition probabilities $p({s}_{t+1}{s}_{t},{a}_{t})$. This can be formally stated as a Markov Decision Process (MDP) where the next state can be completely defined by the previous state and action (Markov property) and the agent receives a scalar reward for executing the action [4].
The goal of the agent is to maximize the cumulative reward (discounted sum of rewards) or value function:
$${v}_{t}=\sum _{k=0}^{\mathrm{\infty}}{\gamma}^{k}{r}_{t+k}$$  (1) 
where $0\le \gamma \le 1$ is the discount factor and ${r}_{t}$ is the reward at timestep $t$.
In terms of a policy $\pi :S\to A$, the value function can be given by Bellman equation as:
${v}_{\pi}({s}_{t})$  $=\underset{\pi}{\mathbb{E}}[{\displaystyle \sum _{k=0}^{\mathrm{\infty}}}{\gamma}^{k}{r}_{t+k}S={s}_{t}]$  (2)  
$=\underset{\pi}{\mathbb{E}}[{r}_{t}+{\displaystyle \sum _{k=1}^{\mathrm{\infty}}}{\gamma}^{k}{r}_{t+k}S={s}_{t}]$  (3)  
$\begin{array}{cc}& ={\displaystyle \sum _{a}}\pi ({a}_{t}{s}_{t}){\displaystyle \sum _{{s}_{t+1}}}p({s}_{t+1}{s}_{t},{a}_{t})\hfill \\ & [r({s}_{t},{a}_{t},{s}_{t+1})+\gamma \underset{\pi}{\mathbb{E}}[{\displaystyle \sum _{k=0}^{\mathrm{\infty}}}{\gamma}^{k}{r}_{t+k}S={s}_{t+1}]]\hfill \end{array}$  (4)  
$\begin{array}{cc}& ={\displaystyle \sum _{a}}\pi ({a}_{t}{s}_{t}){\displaystyle \sum _{{s}_{t+1}}}p({s}_{t+1}{s}_{t},{a}_{t})\hfill \\ & \left[r({s}_{t},{a}_{t},{s}_{t+1})+\gamma {v}_{\pi}({s}_{t}+1)\right]\hfill \end{array}$  (5) 
Using Bellman’s optimality equation, we can define, for any MDP, a policy $\pi $ is greater than or equal to any other policy ${\pi}^{\prime}$ if value function ${v}_{\pi}({s}_{t})\le {v}_{{\pi}^{\prime}}({s}_{t})$ for all ${s}_{t}\in $ S. This policy is known as an optimal policy (${\pi}^{*}$) and its value function is known as optimal value function (${v}^{*}$).
2.2 Bayesian IRL
The bayesian approach to IRL was proposed by [21] by encoding the reward function preference as a prior and optimal confidence of the behavior data as the likelihood. Considering the expert $\chi $ is executing an MDP $M=(S,A,p,\gamma )$, the reward for $\chi $ is assumed to be sampled from a prior (known) distribution ${P}_{R}$ defined as:
$${P}_{R}(R)=\prod _{s\in S,a\in A}P(R(s,a))$$  (6) 
The distribution to be used as a prior depends on the type of problem. The expert’s goal of maximizing accumulated reward is equivalent to finding the optimal action of each state. The likelihood thus defines our confidence in $\chi $’s ability to select the optimal action. This is modeled as a exponential distribution for the likelihood of trajectory $\tau $ with ${Q}^{*}$ as:
$$P{r}_{\chi}(\tau R)=\frac{1}{{Z}^{\prime}}{\mathrm{exp}}^{{\alpha}_{\chi}{Q}^{*}(\tau ,R)}$$  (7) 
where ${\alpha}_{\chi}$ is a parameter representing the degree of confidence in $\chi $’s ability. The posterior probability of the reward function $R$ is computed using Bayes theorem,
$P{r}_{\chi}(R\tau )$  $={\displaystyle \frac{P{r}_{\chi}(\tau R){P}_{R}(R)}{Pr(\tau )}}$  (8)  
$={\displaystyle \frac{1}{Z}}{\mathrm{exp}}^{{\alpha}_{\chi}{Q}^{*}(\tau ,R)}{P}_{R}(R)$  (9) 
BIRL uses MCMC sampling to compute the posterior mean of the reward function.
2.3 Variational Inference
For observations $x={x}_{1:n}$ and latent variables $z={z}_{1:m}$, the joint density can be written as:
$$p(z,x)=p(z)p(xz)$$  (10) 
The latent variables are drawn from a prior distribution $p(z)$ and they are then related to the observations through the likelihood $p(xz)$. Inference in a bayesian framework amounts to conditioning on data and computing the posterior $p(zx)$. In lot of cases, this posterior is intractable and requires approximate inference. Variational inference has been proposed in the recent years as an alternative to MCMC sampling by using optimization instead of sampling [5].
For a family of approximate densities $\zeta $ over the latent variables, we try to find a member of the family that minimizes the KullbackLeibler (KL) divergence to the exact posterior
$${q}^{*}(z)=\underset{q(z)\in \zeta}{argmin}KL(q(z)p(zx))$$  (11) 
The posterior is then approximated with the optimized member of the family ${q}^{*}(z)$. The KL divergence is then given by
$KL(q(z)p(zx))$  $=\mathbb{E}[log(q(z))]\mathbb{E}[log(p(zx))]$  (12)  
$=\mathbb{E}[log(q(z))]\mathbb{E}[log(p(z,x))]+log(p(x))$  (13) 
Since, the divergence cannot be computed, an alternate objective is optimized in VAE called evidence lower bound (ELBO) that is equivalent,
${L}_{ELBO}(q)$  $=\mathbb{E}[log(p(x))]KL(q(z)p(zx)$  (14)  
$=\mathbb{E}[log(p(xz))]KL(q(zx)p(zx)$  (15) 
This can be defined as a sum of two separate losses:
$${L}_{VAE}={L}_{lk}+{L}_{div}$$  (16) 
where ${L}_{lk}$ is the loss related to the loglikelihood and ${L}_{div}$ is the loss related to the divergence.
CVAE is used to perform probabilistic inference and predict diversely for structured outputs. The loss function is slightly altered with the introduction of class labels $c$:
${L}_{CVAE}$  $=\mathbb{E}[log(p(xc)]KL(q(zx,c)p(zx,c))$  (17)  
$=\mathbb{E}[log(p(xz,c))]KL(q(zx,c)p(zx,c))$  (18) 
2.4 Wasserstein loss function
Wasserstein distance, also known as KantorovichRubenstein distance or earth mover’s distance (EMD) [23], provides a natural distance over probability distributions in the metric space [9]. It is a formulation of optimal transport problem [29] where the Wasserstein distance is the minimum cost required to move a pile of earth (an arbitrary distribution) to another. The mathematical formulation given by Kantorovich [28] is:
$${W}_{c}({P}_{X},{P}_{Y})=\underset{\mathrm{\Gamma}\in P(X\sim {P}_{X},Y\sim {P}_{Y})}{inf}{\mathbb{E}}_{(X,Y)\sim \mathrm{\Gamma}}[c(X,Y)]$$  (19) 
where $c(X,Y)$ is the cost function, $X$ and $Y$ are random variables with marginal distributions ${P}_{X}$ and ${P}_{Y}$ respectively.
EMD has been utilized in various practical applications in computer science such as pattern recognition in images [11]. Wasserstein GAN (WGAN) has been proposed by [3] to minimize the EMD between the generative distribution and the data distribution. [28] proposed Wasserstein Autoencoder (WAE) where the divergence loss has been calculated using the EMD instead of KLdivergence and has been shown to be robust in presence of noise and smaller samples.
3 Conditional Wasserstein AutoEncoderIRL
In this paper, my primary argument is that the inverse reinforcement learning problem can be devised as a supervised learning problem with learning of latent variable. The reward function, $r({s}_{t},{a}_{t},{s}_{t+1})$, can be formulated as a latent function which is dependent on the state at time $t$, ${s}_{t}$, action at time $t$, ${a}_{t}$, and state at time $(t+1)$, ${s}_{t+1}$. In the CVAE framework, using the state and action pair as the class label $c$ and rewriting the CVAE loss in Equation 17 with ${s}_{t+1}$ as $x$ and reward at time $t$, ${r}_{t}$ as $z$, we get:
$${L}_{CVAEIRL}=\mathbb{E}[log(p({s}_{t+1}{s}_{t},{a}_{t}))]KL(q({r}_{t}{s}_{t+1},{s}_{t},{a}_{t})p({r}_{t}{s}_{t+1},{s}_{t},{a}_{t}))$$  (20) 
The first part of Equation 20 provides the log likelihood of transition probability of an MDP and the second part gives the KLdivergence of the encoded reward function to the prior gaussian belief. Thus, the proposed method tries to recover the next state from the current state and current action by encoding the reward function as the latent variable and constraining the reward function to lie as close to the gaussian function. The network structure of the method is given in Figure 1.
3.1 Encoder
The encoder is a neural network which inputs the current state, current action and next state and generates a probability distribution $q({r}_{t}{s}_{t+1},{s}_{t},{a}_{t})$, assumed to be isotropic gaussian. Since a nearoptimal policy is inputted into the IRL framework, minibatches of randomly sampled $({s}_{t+1},{s}_{t},{a}_{t})$ are introduced into the network. Two hidden layers with dropout are used to encode the input data into a latent space, giving two outputs corresponding to the mean and logvariance of the distribution. This step is similar to the backward inference problem in IRL methodology where the reward function is constructed from the sampled trajectories for a nearoptimal agent.
3.2 Decoder
Given the current state and action, the decoder maps the latent space into the state space and reconstructs the next state ${\widehat{s}}_{t+1}$ from a sampled ${r}_{t}$ (from the normal distribution). Similar to the VAE formulation, samples are generated from a standard normal distribution and reparameterized using the mean and logvariance computed in the encoder step. This step resembles the forward inference problem of an MDP where given a state, action and reward distribution, we estimate the next state that the agent gets to. Two hidden layers with dropout are used similar to the encoder.
3.3 Using Wasserstein loss
Even though the KLdivergence should be able to provide for the loss theoretically, it does not converge in practice and indeed gives really large values in case of small samples such as in our formulation. [28] provides a Maximum Mean Discrepancy (MMD) measure based on Wasserstein metric for a positivedefinite reproducing kernel $k(\cdot ,\cdot )$ such as the Radial Basis Function (RBF) kernel:
$$MM{D}_{k}({P}_{z},{Q}_{z})={\parallel {\int}_{z}k(z,\cdot )\mathit{d}{P}_{z}(z){\int}_{z}k(z,\cdot )\mathit{d}{Q}_{z}(z)\parallel}_{{H}_{k}}$$  (21) 
where ${H}_{k}$ is the Reproducing Kernel Hilbert Space (RKHS) mapping $z:\mathbb{Z}\to \mathbb{R}$. The divergence loss can then be written as:
$$\begin{array}{c}W(q(zx)p(zx))=\frac{\lambda}{n(n1)}\sum _{l\ne j}k({z}_{l},{z}_{j})+\frac{\lambda}{n(n1)}\sum _{l\ne j}k(\stackrel{~}{{z}_{l}},\stackrel{~}{{z}_{j}})\hfill \\ \hfill \frac{2\lambda}{{n}^{2}}\sum _{l,j}k({z}_{l},\stackrel{~}{{z}_{j}})\end{array}$$  (22) 
where $c$ is the cost between the input, $x$ and the output, $\stackrel{~}{z}$, of the decoder, $D$, using the sampled latent variable, $z$ (given as mean squared error), The resulting CWAEIRL loss function is given as:
$${L}_{CWAEIRL}=\mathbb{E}[log(p({s}_{t+1}{s}_{t},{a}_{t}))]W(q({r}_{t}{s}_{t+1},{s}_{t},{a}_{t})p({r}_{t}{s}_{t+1},{s}_{t},{a}_{t}))$$  (23) 
4 Experiments
In this section, I present the results of CWAEIRL on two simulated tasks, objectworld and pendulum.
4.1 Objectworld
Objectworld is a generalization of gridworld [25], described in [16]. It contains NxN grid of states with five actions per state, corresponding to steps in each direction and staying in place. Each action has a 30% chance of moving in a different random direction. There are randomly assigned objects, having one of 2 inner and outer colors chosen, red and green. There are 4 continuous features for each of the grids, each giving the Euclidean distance to the nearest object with a specific inner or outer color. The true reward is +1 in states within 3 cells of outer red and 2 cells of outer green, 1 within 3 cells of outer red, and zero otherwise. Inner colors serve as distractors. The expert trajectories fed have a sample length of 16. The algorithms for objectworld, Maximum Entropy IRL and Deep Maximum Entropy IRL are used from the GitHub implementation of [2] without any modifications. Only continuous features are used for all implementations.
CWAEIRL is compared with prior IRL methods, described in the previous sections. Among prior methods chosen, CWAEIRL is compared to BIRL [21], Maximum Entropy IRL [33] and Deep Maximum Entropy IRL [31]. Only the Maximum Entropy IRL uses the reward as a linear combination of features while the others describe it in a nonlinear fashion. The learnt rewards for all the algorithms are shown in Figure 2 with an objectworld of grid size 10. CWAEIRL can recover the original reward distribution while the Deep Maximum Entropy IRL overestimates the reward in various places. Maximum Entropy IRL and BIRL completely fail to learn the rewards. Deep Maximum Entropy tends to give negative rewards to state spaces which are not well traversed in the example trajectories. However, CWAEIRL generalizes over the spaces even though they have not been visited frequently. Also, due to the constraint of being close to the prior gaussian belief, the scale of rewards are best captured by the proposed algorithm as compared to the other algorithms which tend to overscale the rewards.
4.2 Pendulum
The pendulum environment [6] is an wellknown problem in the control literature in which a pendulum starts from a random position and the goal is to keep it upright while applying the minimum amount of force. The state vector is composed of the cosine (and sine) of the angle of the pendulum, and the derivative of the angle. The action is the joint effort as $11$ discrete actions linearly spaced within the $[2,2]$ range. The reward is
$$R=({w}_{1}\cdot {\parallel \theta \parallel}^{2}+{w}_{2}\cdot {\parallel \dot{\theta}\parallel}^{2}+{w}_{3}\cdot {\parallel a\parallel}^{2}),$$  (24) 
where ${w}_{1}$, ${w}_{2}$ and ${w}_{3}$ are the reward weights for the angle $\theta $, derivative of angle $\dot{\theta}$ and action $a$ respectively. The optimal reward weights given by OpenAI are $[1,0.1,0.001]$ respectively. An episode is limited to 1000 timesteps.
A deep Qnetwork (DQN) has been proposed by [17] that combines deep neural networks with RL to solve continuous state discrete action problems. DQN uses a neural network with gives the Qvalues for every action and uses a buffer to store old states and actions to sample from to help stabilize training. Using a continuous state space makes it impossible to have all states visited during training. This also makes it very difficult for the comparison of recovered reward with the actual reward. The DQN is trained for 50,000 episodes. The CWAEIRL is trained using 25 trajectories and the reward is predicted for 5 trajectories. The error plot between the reward recovered and the actual reward is given in Figure 3. The mean error hovers around 0 showing that under for majority of the states and actions, the proposed method is able to recover the correct reward.
5 Conclusions
I have presented an algorithm for inverse reinforcement learning which learns the latent reward function using a conditional variational autoencoder with Wasserstein loss function. It learns the reward function as a continuous, Gaussian distribution while trying to reconstruct the next state given the current state and action. The proposed model makes the inference process scalable while making it easier for inference of reward functions. Inferring a continuous parametric distribution of the reward functions can prove useful for classifying behaviors of decision making agents in diverse applications such as autonomous driving.
Acknowledgments
I would like to acknowledge the help of Satabdi Saha in editing the manuscript.
References
 [1] (2004) Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twentyfirst international conference on Machine learning, pp. 1. Cited by: §1, §1.
 [2] (2016) Inverse reinforcement learning. External Links: Document, Link Cited by: §4.1.
 [3] (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §2.4.
 [4] (1957) A markovian decision process. Journal of Mathematics and Mechanics, pp. 679–684. Cited by: §1, §2.1.
 [5] (2017) Variational inference: a review for statisticians. Journal of the American Statistical Association 112 (518), pp. 859–877. Cited by: §2.3.
 [6] (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §4.2.
 [7] (2015) Graphbased inverse optimal control for robot manipulation.. In Ijcai, Vol. 15, pp. 1874–1890. Cited by: §1.
 [8] (2016) Guided cost learning: deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pp. 49–58. Cited by: §1, §1, §1.
 [9] (2015) Learning with a wasserstein loss. In Advances in Neural Information Processing Systems, pp. 2053–2061. Cited by: §2.4.
 [10] (2017) Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248. Cited by: §1.
 [11] (2018) Wasserstein cnn: learning invariant features for nirvis face recognition. IEEE transactions on pattern analysis and machine intelligence 41 (7), pp. 1761–1773. Cited by: §2.4.
 [12] (1999) An introduction to variational methods for graphical models. Machine learning 37 (2), pp. 183–233. Cited by: §1.
 [13] (2014) Stochastic gradient vb and the variational autoencoder. In Second International Conference on Learning Representations, ICLR, Cited by: §1.
 [14] (2013) A cascaded supervised learning approach to inverse reinforcement learning. In Joint European conference on machine learning and knowledge discovery in databases, pp. 1–16. Cited by: §1.
 [15] (2012) Continuous inverse optimal control with locally optimal examples. arXiv preprint arXiv:1206.4617. Cited by: §1.
 [16] (2011) Nonlinear inverse reinforcement learning with gaussian processes. In Advances in Neural Information Processing Systems, pp. 19–27. Cited by: §1, §4.1.
 [17] (2015) Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §4.2.
 [18] (2009) Training parsers by inverse reinforcement learning. Machine learning 77 (23), pp. 303. Cited by: §1.
 [19] (1999) Policy invariance under reward transformations: theory and application to reward shaping. In ICML, Vol. 99, pp. 278–287. Cited by: §1.
 [20] (2000) Algorithms for inverse reinforcement learning.. In Icml, pp. 663–670. Cited by: §1, §1, §1, §1.
 [21] (2007) Bayesian inverse reinforcement learning. Urbana 51 (61801), pp. 1–4. Cited by: §1, §1, §1, §2.2, §4.1.
 [22] (2006) Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pp. 729–736. Cited by: §1.
 [23] (2000) The earth mover’s distance as a metric for image retrieval. International journal of computer vision 40 (2), pp. 99–121. Cited by: §2.4.
 [24] (2015) Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483–3491. Cited by: §1.
 [25] (1998) Reinforcement learning: an introduction. Vol. 1, MIT press Cambridge. Cited by: §4.1.
 [26] (2008) A gametheoretic approach to apprenticeship learning. In Advances in neural information processing systems, pp. 1449–1456. Cited by: §1.
 [27] (2007) Linearlysolvable markov decision problems. In Advances in neural information processing systems, pp. 1369–1376. Cited by: §1.
 [28] (2017) Wasserstein autoencoders. arXiv preprint arXiv:1711.01558. Cited by: §1, §2.4, §2.4, §3.3.
 [29] (2008) Optimal transport: old and new. Vol. 338, Springer Science & Business Media. Cited by: §2.4.
 [30] (2008) Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning 1 (1–2), pp. 1–305. Cited by: §1.
 [31] (2015) Deep inverse reinforcement learning. CoRR, abs/1507.04888. Cited by: §1, §4.1.
 [32] (2014) Robust bayesian inverse reinforcement learning with sparse behavior noise.. In AAAI, pp. 2198–2205. Cited by: §1.
 [33] (2008) Maximum entropy inverse reinforcement learning.. In AAAI, Vol. 8, pp. 1433–1438. Cited by: §1, §1, §1, §4.1.