Stochastic Inverse Reinforcement Learning

  • 2020-07-29 11:31:49
  • Ce Ju, Dong Eui Chang
  • 0

Abstract

Inverse reinforcement learning (IRL) is an ill-posed inverse problem sinceexpert demonstrations may infer many solutions of reward functions which ishard to recover by local search methods such as a gradient method. In thispaper, we generalize the original IRL problem to recover a probabilitydistribution for reward functions. We call such a generalized problemstochastic inverse reinforcement learning (SIRL) which is first formulated asan expectation optimization problem. We adopt the Monte Carloexpectation-maximization (MCEM) method, a global search method, to estimate theparameter of the probability distribution as the first solution to SIRL. Withour approach, it is possible to observe the deep intrinsic property in IRL froma global viewpoint, and the technique achieves a considerable robust recoveryperformance on the classic learning environment, objectworld.

 

Quick Read (beta)

Stochastic Inverse Reinforcement Learning

Ce Ju
WeBank Co., Ltd.
Shenzhen, P.R. China
[email protected]
&Dong Eui Chang
School of Electrical Engineering
Korea Advanced Institute of Science and Technology
Daejeon, South Korea
[email protected]
Abstract

Inverse reinforcement learning (IRL) is an ill-posed inverse problem since expert demonstrations may infer many solutions of reward functions which is hard to recover by local search methods such as a gradient method. In this paper, we generalize the original IRL problem to recover a probability distribution for reward functions. We call such a generalized problem stochastic inverse reinforcement learning (SIRL) which is first formulated as an expectation optimization problem. We adopt the Monte Carlo expectation-maximization (MCEM) method, a global search method, to estimate the parameter of the probability distribution as the first solution to SIRL. With our approach, it is possible to observe the deep intrinsic property in IRL from a global viewpoint, and the technique achieves a considerable robust recovery performance on the classic learning environment, objectworld.

 

Stochastic Inverse Reinforcement Learning


  Ce Ju WeBank Co., Ltd. Shenzhen, P.R. China [email protected] Dong Eui Chang School of Electrical Engineering Korea Advanced Institute of Science and Technology Daejeon, South Korea [email protected]

\@float

noticebox[b]Preprint. Under review.\[email protected]

1 Introduction

IRL addresses an ill-posed inverse problem that expert demonstrations may yield a reward function in a Markov decision process (MDP) [1]. The recovered reward function will quantify how good or bad the certain actions are. With the knowledge of reward function, agents can perform better. However, not all reward functions will provide a succinct, robust, and transferable definition of the learning task, and a policy may be optimal for many reward functions [1]. For example, any given policy is optimal for the constant reward function in an MDP. Thus, the core of IRL is to explore the regular structure of reward functions.

Many existing methods impose the regular structures of reward functions in a combination of hand-selected features. Formally, a set of real-valued feature functions {ϕi(s)}i=1M is given by a hand-selection of experts as basis functions of the reward function space in an MDP. Then, the reward function is approximated by a linear or nonlinear combination of the hand-selected feature functions ϕi(s). The goal of this approach is to find the best-fitting weights of the feature functions in reward functions. One novel framework in this approach is to formulate the IRL problem as a numerical optimization problem [1; 2; 3; 4], and the other is based on maximizing a posteriori in a probabilistic approach [5; 6; 7; 8].

In this paper, we propose a generalized perspective of studying the IRL problem called stochastic inverse reinforcement learning (SIRL) which is formulated as an expectation optimization problem aiming to recover a probability distribution of the reward function from expert demonstrations. Typically, the solution to classic IRL is not always best-fitting because a highly nonlinear inverse problem with limited information from a collection of expert behavior is very likely to get trapped in a secondary maximum for a partially observable system. Thus, we employ the MCEM approach [9], which adopts a Monte Carlo mechanism for exhaustive search as a global search method, to give the first solution to the SIRL problem in a model-based environment, and then we obtain the desired reward conditional probability distribution which can generate more than one weight for reward feature functions as composing alternative solutions to IRL problem. The most benefit of our generalized perspective gives a method that allows analysis and display of any given highly nonlinear IRL problem with a large collection of pseudorandomly generated local likelihood maxima. In view of the successful application of IRL in imitation learning and apprenticeship learning in the industry [4; 10; 11; 12], our generalized method demonstrates a great potential practical value.

2 Preliminary

An MDP is defined as a tuple :=(𝒮,𝒜,𝒯,R,γ), where 𝒮 is the set of states, 𝒜 is the set of actions, and the transition function 𝒯:=(st+1=s|st=s,at=a) for s,s𝒮 and a𝒜 records the probability of being current state s, taking action a and yielding next state s. Reward (s) is a real-valued function and γ[0,1) is the discount factor. A policy π, which is a map from states to actions, has two formulations. The stochastic policy refers to a conditional distribution π(a|s), and the deterministic policy is represented by a deterministic function a=π(s). Sequential decisions are recorded in a series of episodes which consist of states, actions, and rewards. The goal of a reinforcement learning task is to find the optimal policy π* that optimizes the expected total reward 𝔼[t=0γt(st)|π]. In an IRL problem setting, we have an MDP without a reward function, denoted as an MDP\R, and a collection of expert demonstrations ζE:={ζ1,,ζm}. Each demonstration ζi consists of sequential state-action pairs representing the behavior of an expert. The goal of IRL problem is to estimate the unknown reward function (s) from expert demonstrations for an MDP\R. The learned complete MDP yields an optimal policy that acts as closely as the expert demonstrations.

2.1 MaxEnt and DeepMaxEnt

In this section, we provide a small overview of the probabilistic approach for the IRL problem under two existing kinds of the regular structure on reward functions. One kind is of a linear structure on the reward function. A reward function is always written as (s)=iMαiϕi(s), where ϕi:𝒮d are a d-dimensional feature functions given by a hand-selection of experts [1]. Ziebart et al. [6] propose a probabilistic approach dealing with the ambiguity of the IRL problem based on the principle of maximum entropy, which is called Maximum entropy IRL (MaxEnt). In MaxEnt, we always assume that trajectories with higher rewards are exponentially more preferred (ζ|R)expsζR(s), where ζ is one trajectory from expert demonstrations. The objective function for MaxEnt is derived from maximizing the likelihood of expert trajectory under the maximum entropy, and it is always convex for deterministic MDPs. Typically, the optimum is obtained by a gradient-based method. The other kind is of a nonlinear structure on the reward function which is of the form (s)=(ϕ1(s),,ϕM(s)), where is a nonlinear function of feature basis {ϕi(s)}i=1M. Following the principle of maximum entropy, Wulfmeier et al. [7] extend MaxEnt by adopting a neural network-based approach approximating the unknown nonlinear reward, which is called Maximum entropy deep IRL (DeepMaxEnt).

To generalize the classic regular structures on the reward function, we propose a stochastic regular structure on the reward function in the following section.

2.2 Problem Statement

Formally, we are given an MDP\R=(𝒮,𝒜,𝒯,γ) with a known transition function 𝒯=(st+1=s|st=s,at=a) for s,s𝒮 and a𝒜 and a hand-crafted reward feature basis {ϕi(s)}i=1M. A stochastic regular structure on the reward function assumes weights 𝒲 of the reward feature functions ϕi(s), which are random variables with a reward conditional probability distribution 𝒟(𝒲|ζE) conditional on expert demonstrations ζE. Parametrizing 𝒟(𝒲|ζE) with parameter Θ, our aim is to estimate the best-fitting parameter Θ* from the expert demonstrations ζE, such that 𝒟(𝒲|ζE,Θ*) more likely generates weights to compose reward functions as the ones derived from expert demonstrations, which is called stochastic inverse reinforcement learning problem.

In practice, expert demonstration ζE can be observed but lack of sampling representativeness [13]. For example, one driver’s demonstrations encode his own preferences in driving style but may not reflect the true rewards of an environment. To overcome this limitation, we introduce a representative trajectory class 𝒞ϵE such that each trajectory element set 𝒪𝒞ϵE is a subset of expert demonstrations ζE with the cardinality at least ϵ|ζE|, where ϵ is a preset threshold and |ζE| is the number of expert demonstrations, and it is written as 𝒞ϵE:={𝒪|𝒪ζE with |𝒪|ϵ|ζE|}.

We integrate out unobserved weights 𝒲, and then SIRL problem is formulated to estimate parameter Θ on an expectation optimization problem over the representative trajectory class as follows:

Θ*:=argmaxΘ𝔼𝒪𝒞ϵE[𝒲f(𝒪,𝒲|Θ)𝑑𝒲], (1)

where trajectory element set 𝒪 assumes to be uniformly distributed for the sake of simplicity in this study but usually known from the rough estimation of the statistics in expert demonstrations in practice, and f is the conditional joint probability density function of trajectory element 𝒪 and weights 𝒲 for reward feature functions conditional on parameter Θ.

In the following section, we propose a novel approach to estimate the best-fitting parameter Θ* in Equation 1, which is called the two-stage hierarchical method, a variant of MCEM method.

3 Methodology

3.1 Two-stage Hierarchical Method

The two-stage hierarchical method requires us to write parameter Θ in a profile form Θ:=(Θ1,Θ2). The conditional joint density f(𝒪,𝒲|Θ) in Equation 1 can be written as the product of two conditional densities g and h as follows:

f(𝒪,𝒲|Θ1,Θ2)=g(𝒪|𝒲,Θ1)h(𝒲|Θ2). (2)

Take the log of both sides in Equation 2, and we have

logf(𝒪,𝒲|Θ1,Θ2)=logg(𝒪|𝒲,Θ1)+logh(𝒲|Θ2). (3)

We optimize the right side of Equation 3 over the profile parameter Θ in the expectation-maximization (EM) update steps at the t-th iteration independently as follows,

Θ1t+1: =argmaxΘ1𝔼(logg(𝒪|𝒲,Θ1)|𝒞ϵE,Θt); (4)
Θ2t+1: =argmaxΘ2𝔼(logh(𝒲|Θ2)|Θt). (5)

3.1.1 Initialization

We randomly initialize profile parameter Θ0:=(Θ10,Θ20) and sample a collection of N0 rewards weights {𝒲1Θ0,,𝒲NΘ0}𝒟(𝒲|ζE,Θ20). The reward weights 𝒲iΘ0 compose reward R𝒲iΘ0 in each learning task i0:=(𝒮,𝒜,𝒯,R𝒲iΘ0,γ) for i=1N0.

3.1.2 First Stage

In the first stage, we aim to update parameter Θ1 for the intractable expectation in Equation 4 in each iteration. Specifically, we take a Monte Carlo method to estimate model parameters Θ1t+1 in an empirical expectation at the t-th iteration,

[logg(𝒪|𝒲,Θ1t+1)|𝒞ϵE,Θt]:=1Nti=1Ntloggit(𝒪it|𝒲iΘt,Θ1t+1), (6)

where reward weights at the t-th iteration 𝒲iΘt are randomly drawn from the reward conditional probability distribution 𝒟(𝒲|ζE,Θt) and compose a set of learning tasks it:=(𝒮,𝒜,𝒯,R𝒲iΘt,γ) with a trajectory element set 𝒪it uniformly drawn from representative trajectory class 𝒞ϵE, for i=1,,Nt.

The parameter Θ1t+1 in Equation 6 has Nt coordinates written as Θ1t+1:=((Θ1t+1)1,,(Θ1t+1)Nt). For each learning task it, the i-th coordinate (Θ1t+1)i is derived from maximization of a posteriori, the same trick as the ones in MaxEnt and DeepMaxEnt [6; 7], as follows:

(Θ1t+1)i:=argmaxθloggit(𝒪it|𝒲iΘt,θ),

which is a convex formulation optimized in a gradient ascent method.

In practice, we move m steps uphill to the optimum in each learning task it. The update formula of m-step reward weights 𝒲miΘt is written as

𝒲miΘt:=𝒲iΘt+i=1mλit(Θ1)iloggit(𝒪it|𝒲iΘt,(Θ1)i),

where the learning rate λit at the t-th iteration is preset. Hence, the parameter Θ1t+1 in practice is represented as Θ1t+1:=(𝒲m1Θt,,𝒲mNtΘt).

3.1.3 Second Stage

In the second stage, we aim to update parameter Θ2 for the intractable expectation in Equation 5 in each iteration. Specifically, we consider the empirical expectation at the t-th iteration as follows,

(logh(𝒲|Θ2t+1)|Θt):=1Nti=1Ntloghit(𝒲miΘt|Θ2t+1). (7)

where h is implicit but fitting a set of m-step reward weights {𝒲miΘt}i=1Nt in a generative model yields a large empirical expectation value. The reward conditional probability distribution 𝒟(𝒲|ζE,Θ2t+1) is the generative model formulated in a Gaussian Mixture Model (GMM) in practice, i.e.

𝒟(𝒲|ζE,Θ2t+1):=k=1Kαk𝒩(𝒲|μk,Σk)

with αk0, k=1Kαk=1, and parameter set Θ2t+1:={αk;μk,Σk}k=1K.

We estimate parameter Θ2t+1 in GMM via an EM approach and initialize GMM with the t-th iteration parameter Θ2t. The EM procedures are given as follows: for i=1,,Nt,

  • Expectation Step: Compute responsibility γij for m-step reward weight 𝒲miΘt,

    γij:=αj𝒩(𝒲miΘt|μj,Σj)k=1Kαk𝒩(𝒲miΘt|μk,Σk).
  • Maximization Step: Compute weighted mean μj and variance Σj,

    μj: =i=1Ntγij𝒲miΘti=1Ntγij;αj:=1Nti=1Ntγij;
    Σj: =i=1Ntγij(𝒲miΘt-μj)(𝒲miΘt-μj)Ti=1Ntγij.

After the EM converges, parameter Θ2t+1:={αk;μk,Σk}k=1K of GMM in this iteration, and profile parameter Θt+1:=(Θ1t+1,Θ2t+1).

Finally, when the two-stage hierarchical method converges, our desired best-fitting parameter Θ* in 𝒟(𝒲|ζE,Θ*) is parameter Θ2 in profile parameter Θ.

3.2 Termination Criteria

In this section, we will talk about the termination criteria in our algorithm. An ordinary EM algorithm terminates usually when the parameters do not substantively change after enough iterations. For example, one classic termination criterion in the EM algorithm terminates at the t-th iteration satisfying,

max|θt-θt-1||θt|+δEM<ϵEM

for user-specified δEM and ϵEM, where θ is the model parameter in the EM algorithm.

However, such a termination criterion in MCEM has the risk of terminating early because of the Monte Carlo error in the update step. Hence, we adopt a practical method in which the following similar stopping criterion holds in three consecutive times,

max|Θt-Θt-1||Θt|+δMCEM<ϵMCEM

for user-specified δMCEM and ϵMCEM [14]. For various other stopping criteria of MCEM in the literature refers to [15; 16].

3.3 Convergence Issue

The convergence issue of MCEM is more complicated than ordinary EM. In light of model-based interactive MDP\R, we can always increase the sample size of MCEM per iteration. We require the Monte Carlo sample size per iteration in practice satisfy the following inequality,

t1Nt<.

An additional requirement is that parameter space should be compact for the convergence property. For a comprehensive proof, refer to [16; 17].

A pseudocode of our approach is given in Algorithm 1.

\SetAlgoLined\SetAlgoNoLineInput: Model-based environment (𝒮,𝒜,𝒯) and expert demonstrations ζE, Monte Carlo sample size N0, and preset thresholds δMCEM and ϵMCEM.
Output: Reward conditional probability distribution 𝒟(𝒲|ζE,Θ*).
Initialization: Randomly initialization of profile parameter Θ0:=(Θ10,Θ20)\[email protected]
\Whilestopping criteria not satisfied (refer to Section 3.2) Draw Nt reward weights 𝒲iΘt𝒟(𝒲|ζE,Θ2t) to compose learning task it with uniformly drawn trajectory element set 𝒪it\[email protected]
# First Stage: Monte Carlo estimation of weights for reward function\[email protected]
\For it Evaluate (Θ1)iloggit(𝒪it|𝒲iΘt,(Θ1)i) \[email protected]
Compute updated weight parameter 𝒲miΘt𝒲iΘt+i=1mλit(Θ1)iloggit(𝒪it|𝒲iΘt,(Θ1)i) \[email protected]
Update Θ1t+1{𝒲miΘt}i=1Nt \[email protected]
# Second Stage: Fit GMM with m-step reward weight {WmiΘt}i=1Nt with EM parameter initialization Θ2t \[email protected]
\WhileEM not converge Expectation Step: γijαj𝒩(𝒲miΘt|μj,Σj)k=1Kαk𝒩(𝒲miΘt|μk,Σk) \[email protected]
Maximization Step:
μj i=1Ntγij𝒲miΘti=1Ntγij;αj1Nti=1Ntγij;
Σj i=1Ntγij(𝒲miΘt-μj)(𝒲miΘt-μj)Ti=1Ntγij;
Update Θ2t+1 and profile parameter Θt+1(Θ1t+1,Θ2t+1)\[email protected]
\algorithmcfname 1 Stochastic Inverse Reinforcement Learning

4 Experiments

We evaluate our approach on an environment, objectworld, which is a particularly challenging environment with a large number of irrelevant features and the highly nonlinearity of the reward functions. We employ the expected value difference (EVD) to be the metric of optimality as follows:

EVD(𝒲):=𝔼[t=0γtR(st)|π*]-𝔼[t=0γtR(st)|π(𝒲)],

which is a measure of the difference between the expected reward earned under the optimal policy π*, given by the true reward, and the policy derived from the rewards sampling from our reward conditional probability distribution 𝒟(𝒲|ζE,Θ*). Notice that we use Θ* to denote the best estimation parameter in our approach.

4.1 Objectworld

The objectworld is an IRL environment proposed by Levine et al. [5] which is an N×N grid board with colored objects placed in randomly selected cells. Each colored object is assigned one inner color and one outer color from C preselected colors. Each cell on the grid board is a state, and stepping to four neighbor cells (up, down, left, right) or staying in place (stay) are five actions with a 30% chance of moving in a random direction.

The ground truth of reward function is defined in the following way. Suppose two primary colors of C preselected colors are red and blue. The reward of a state is 1 if the state is within 3 steps of an outer red object and 2 steps of an outer blue object, -1 if the state is within 3 steps of an outer red object, and 0 otherwise. The other pairs of inner and outer colors are distractors. Continuous and discrete versions of feature basis functions are provided. For the continuous version, ϕ(s) is a 2C-dimensional real-valued feature vector. Each dimension records the Euclidean distance from the state to objects. For example, the first and second coordinates are the distances to the nearest inner and outer red object respectively, and so on through all C colors. For the discrete version, ϕ(s) is a (2CN)-dimensional binary feature vector. Each N-dimensional vector records a binary representation of distance to the nearest inner or outer color object with the d-th coordinate 1 if the corresponding continuous distance is less than d.

4.2 Evaluation Procedure and Analysis

In this section, we design several tasks to evaluate our generative model, reward conditional probability distribution 𝒟(𝒲|ζE,Θ*). For each task, the environment setting is as follows. The instance of 10×10 objectworld has 25 random objects with 2 colors and 0.9 discount factor. 200 expert demonstrations are generated according to the true optimal policy for the recovery. The length of each expert demonstration is 5 grid size trajectory length. We have four algorithms in the evaluation including MaxEnt, DeepMaxEnt, SIRL, and DSIRL. SIRL and DSIRL are implemented as in Algorithm 1 with an assumption of the linear and nonlinear structure of reward functions respectively, i.e. the drawn weights from reward conditional probability distribution will compose the coefficients in a linear or nonlinear combination of feature functions.

In our evaluation, SIRL and DSIRL start from 10 samples and double the sample size per iteration until it converges. In the first stage, the epochs of algorithm iteration are set to 20 and the learning rates are 0.01. The parameter ϵ in representative trajectory set 𝒪ϵE is preset to 0.95. In the second stage, GMM in both SIRL and DSIRL has three components with at most 1000 iterations before convergence. Additionally, the neural networks for DeepMaxEnt and DSIRL are both implemented in a 3-layer fully-connected architecture with the sigmoid function as the activation function.

4.2.1 Evaluation Platform

All the methods are implemented in Python 3.5 and Theano 1.0.0 with a machine learning distributed framework, Ray [18]. The experiments are conducted on a machine with Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz and Nvidia GeForce GTX 1070 GPU.

4.2.2 Recovery Experiment

In this experiment, we aim to compare the ground truth of the reward function, the optimal policy, and the optimal value with the ones derived from four algorithms on objectworld. For SIRL and DSIRL, the mean of reward conditional probability distribution is used as the comparison object. In Figure 1, we notice that the mean of DSIRL performs better than DeepMaxEnt, and the mean of SIRL is better than MaxEnt because of our Monte Carlo mechanism, a global search approach, in our algorithm. Both MaxEnt and DeepMaxEnt are very likely to get trapped in a secondary maximum. Because of highly nonlinear ground truth, the mean of DSIRL beats the mean of SIRL in this task. Our generalized perspective can be regarded as an average of all outcomes which is prone to imitate the true optimal value from finite expert demonstrations.

Figure 1: Results for recovery experiment. The EVD values, which is the difference of optimal values in last row, for four algorithms are 48.9, 31.1, 33.7 and 11.3 respectively. The covariances of GMM model for SIRL and DSIRL are limited up to 5.53 and 1.36 on each coordinate respectively.

4.2.3 Generative Experiment

In this experiment, we aim to evaluate the generativeness of the reward conditional probability distribution 𝒟(𝒲|ζE,Θ*) which will generate more than one reward function as the solution for the IRL problem. Each optimal value derived from the drawn reward has a small EVD value compared with the true original reward. We design an generative algorithm to capture robust solutions. The pseudocode of generative algorithm is given in Algorithm 2, and for experimental result refers to Figure 2.

In the generative algorithm, we always use the Frobenius norm to measure the distance between weights (matrices) drawn from the reward conditional probability distribution, given by

||𝒲||:=Tr(𝒲𝒲T).

Each drawn weight 𝒲𝒟(𝒲|ζE,Θ*) in the solution set 𝒢 should satisfy

||𝒲-𝒲||>δ and EVD(𝒲)<ϵ,

where 𝒲 represents any other member in the solution set 𝒢, and δ,ϵ are the preset thresholds in the generative algorithm.

\SetAlgoLined\SetAlgoNoLineInput: 𝒟(𝒲|ζE,Θ*), required solution set size N, and preset thresholds ϵ and δ.
Output: Solution set 𝒢:={𝒲i}i=1N.
\Whilei < N 𝒲𝒟(𝒲|ζE,Θ*) \[email protected]
\Forany 𝒲𝒮 \If||𝒲-𝒲||>δ and EVD(𝒲)<ϵ 𝒢𝒲\[email protected]
\algorithmcfname 2 Generative Algorithm
Figure 2: Results for the generative experiment. The right four columns are generated from four drawn weights in the solution set 𝒢 with EVD values around 10.2. We notice that each recovered reward function in the first row has a different appearance in the pattern. It implies our generative model can generate robust and more than one solution for an IRL problem.

4.2.4 Hyperparameter Experiment

In this experiment, we aim to evaluate the effectiveness of our approach under the influence of preset variant quantities and qualities of expert demonstrations. The amount of information carried in expert demonstrations will compose a specific learning environment, and hence has an impact on the effectiveness of our generative model. Due to page limit, we only verify three hyperparameters including the number of expert demonstrations in Figure 5, the trajectory length of expert demonstrations in Figure 5 and the portion size in representative trajectory class 𝒞ϵE in Figure 5 on objectworld. The Shadow of the line in the figures represents the standard error for each experimental trail. Notice that the EVDs for SIRL and DSIRL are both decreasing as the number and the trajectory length of expert demonstrations, and the portion size in the representative trajectory class are increasing.

Figure 3: Results under 40, 80, 160, 320, 640, 1280 and 2560 expert demonstrations.
Figure 4: Results under 1, 2, 4, 8, 16, 32, and 64 grid size trajectory length of expert demonstrations.
Figure 5: Results under 0.65, 0.70, 0.75, 0.80, 0.85, 0.90, and 0.95 portion size in 𝒞ϵE.

5 Conclusion

In this paper, we propose a generalized perspective for IRL problem called stochastic inverse reinforcement learning problem. We formulate it as an expectation optimization problem and adopt the MCEM method to give the first solution to it. The solution to SIRL gives a generative model to produce more than one reward function for original IRL problem, making it possible to analyze and display highly nonlinear IRL problem from a global viewpoint. The experimental results demonstrate the recovery and generative ability of the generative model under the comparison metric EVD. We then show the effectiveness of our model under the influence of a set of hyperparameters of expert demonstrations.

References

  • [1] Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, page 2, 2000.
  • [2] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 1. ACM, 2004.
  • [3] Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pages 729–736. ACM, 2006.
  • [4] Pieter Abbeel, Dmitri Dolgov, Andrew Y Ng, and Sebastian Thrun. Apprenticeship learning for motion planning with application to parking lot navigation. In 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1083–1090. IEEE, 2008.
  • [5] Sergey Levine, Zoran Popovic, and Vladlen Koltun. Nonlinear inverse reinforcement learning with gaussian processes. In Advances in Neural Information Processing Systems, pages 19–27, 2011.
  • [6] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
  • [7] Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. Maximum entropy deep inverse reinforcement learning. arXiv preprint arXiv:1507.04888, 2015.
  • [8] Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. In IJCAI, volume 7, pages 2586–2591, 2007.
  • [9] Greg CG Wei and Martin A Tanner. A monte carlo implementation of the em algorithm and the poor man’s data augmentation algorithms. Journal of the American statistical Association, 85(411):699–704, 1990.
  • [10] Henrik Kretzschmar, Markus Spies, Christoph Sprunk, and Wolfram Burgard. Socially compliant mobile robot navigation via inverse reinforcement learning. The International Journal of Robotics Research, 35(11):1289–1307, 2016.
  • [11] Pieter Abbeel, Adam Coates, Morgan Quigley, and Andrew Y Ng. An application of reinforcement learning to aerobatic helicopter flight. In Advances in neural information processing systems, pages 1–8, 2007.
  • [12] Dizan Vasquez, Billy Okal, and Kai O Arras. Inverse reinforcement learning algorithms and features for robot navigation in crowds: an experimental comparison. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1341–1346. IEEE, 2014.
  • [13] William Kruskal and Frederick Mosteller. Representative sampling, iii: The current statistical literature. International Statistical Review/Revue Internationale de Statistique, pages 245–265, 1979.
  • [14] James G Booth and James P Hobert. Maximizing generalized linear mixed model likelihoods with an automated monte carlo em algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(1):265–285, 1999.
  • [15] Brian S Caffo, Wolfgang Jank, and Galin L Jones. Ascent-based monte carlo expectation–maximization. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):235–251, 2005.
  • [16] KS Chan and Johannes Ledolter. Monte carlo em estimation for time series models involving counts. Journal of the American Statistical Association, 90(429):242–252, 1995.
  • [17] Gersende Fort, Eric Moulines, et al. Convergence of the monte carlo expectation maximization for curved exponential families. The Annals of Statistics, 31(4):1220–1259, 2003.
  • [18] Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. Ray: A distributed framework for emerging {AI} applications. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), pages 561–577, 2018.