Abstract
Inverse reinforcement learning (IRL) is an illposed inverse problem sinceexpert demonstrations may infer many solutions of reward functions which ishard to recover by local search methods such as a gradient method. In thispaper, we generalize the original IRL problem to recover a probabilitydistribution for reward functions. We call such a generalized problemstochastic inverse reinforcement learning (SIRL) which is first formulated asan expectation optimization problem. We adopt the Monte Carloexpectationmaximization (MCEM) method, a global search method, to estimate theparameter of the probability distribution as the first solution to SIRL. Withour approach, it is possible to observe the deep intrinsic property in IRL froma global viewpoint, and the technique achieves a considerable robust recoveryperformance on the classic learning environment, objectworld.
Quick Read (beta)
Stochastic Inverse Reinforcement Learning
Abstract
Inverse reinforcement learning (IRL) is an illposed inverse problem since expert demonstrations may infer many solutions of reward functions which is hard to recover by local search methods such as a gradient method. In this paper, we generalize the original IRL problem to recover a probability distribution for reward functions. We call such a generalized problem stochastic inverse reinforcement learning (SIRL) which is first formulated as an expectation optimization problem. We adopt the Monte Carlo expectationmaximization (MCEM) method, a global search method, to estimate the parameter of the probability distribution as the first solution to SIRL. With our approach, it is possible to observe the deep intrinsic property in IRL from a global viewpoint, and the technique achieves a considerable robust recovery performance on the classic learning environment, objectworld.
Stochastic Inverse Reinforcement Learning
Ce Ju WeBank Co., Ltd. Shenzhen, P.R. China [email protected] Dong Eui Chang School of Electrical Engineering Korea Advanced Institute of Science and Technology Daejeon, South Korea [email protected]
noticebox[b]Preprint. Under review.\[email protected]
1 Introduction
IRL addresses an illposed inverse problem that expert demonstrations may yield a reward function in a Markov decision process (MDP) [1]. The recovered reward function will quantify how good or bad the certain actions are. With the knowledge of reward function, agents can perform better. However, not all reward functions will provide a succinct, robust, and transferable definition of the learning task, and a policy may be optimal for many reward functions [1]. For example, any given policy is optimal for the constant reward function in an MDP. Thus, the core of IRL is to explore the regular structure of reward functions.
Many existing methods impose the regular structures of reward functions in a combination of handselected features. Formally, a set of realvalued feature functions ${\{{\varphi}_{i}(s)\}}_{i=1}^{M}$ is given by a handselection of experts as basis functions of the reward function space in an MDP. Then, the reward function is approximated by a linear or nonlinear combination of the handselected feature functions ${\varphi}_{i}(s)$. The goal of this approach is to find the bestfitting weights of the feature functions in reward functions. One novel framework in this approach is to formulate the IRL problem as a numerical optimization problem [1; 2; 3; 4], and the other is based on maximizing a posteriori in a probabilistic approach [5; 6; 7; 8].
In this paper, we propose a generalized perspective of studying the IRL problem called stochastic inverse reinforcement learning (SIRL) which is formulated as an expectation optimization problem aiming to recover a probability distribution of the reward function from expert demonstrations. Typically, the solution to classic IRL is not always bestfitting because a highly nonlinear inverse problem with limited information from a collection of expert behavior is very likely to get trapped in a secondary maximum for a partially observable system. Thus, we employ the MCEM approach [9], which adopts a Monte Carlo mechanism for exhaustive search as a global search method, to give the first solution to the SIRL problem in a modelbased environment, and then we obtain the desired reward conditional probability distribution which can generate more than one weight for reward feature functions as composing alternative solutions to IRL problem. The most benefit of our generalized perspective gives a method that allows analysis and display of any given highly nonlinear IRL problem with a large collection of pseudorandomly generated local likelihood maxima. In view of the successful application of IRL in imitation learning and apprenticeship learning in the industry [4; 10; 11; 12], our generalized method demonstrates a great potential practical value.
2 Preliminary
An MDP is defined as a tuple $\mathcal{M}:=(\mathcal{S},\mathcal{A},\mathcal{T},R,\gamma )$, where $\mathcal{S}$ is the set of states, $\mathcal{A}$ is the set of actions, and the transition function $\mathcal{T}:=\mathbb{P}({s}_{t+1}={s}^{\prime}{s}_{t}=s,{a}_{t}=a)$ for $s,{s}^{\prime}\in \mathcal{S}$ and $a\in \mathcal{A}$ records the probability of being current state $s$, taking action $a$ and yielding next state ${s}^{\prime}$. Reward $\mathcal{R}(s)$ is a realvalued function and $\gamma \in [0,1)$ is the discount factor. A policy $\pi $, which is a map from states to actions, has two formulations. The stochastic policy refers to a conditional distribution $\pi (as)$, and the deterministic policy is represented by a deterministic function $a=\pi (s)$. Sequential decisions are recorded in a series of episodes which consist of states, actions, and rewards. The goal of a reinforcement learning task is to find the optimal policy ${\pi}^{*}$ that optimizes the expected total reward $\mathbb{E}\left[{\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}\cdot \mathcal{R}({s}_{t})\pi \right]$. In an IRL problem setting, we have an MDP without a reward function, denoted as an MDP$\backslash $R, and a collection of expert demonstrations ${\zeta}^{E}:=\{{\zeta}^{1},\mathrm{\cdots},{\zeta}^{m}\}$. Each demonstration ${\zeta}^{i}$ consists of sequential stateaction pairs representing the behavior of an expert. The goal of IRL problem is to estimate the unknown reward function $\mathcal{R}(s)$ from expert demonstrations for an MDP$\backslash $R. The learned complete MDP yields an optimal policy that acts as closely as the expert demonstrations.
2.1 MaxEnt and DeepMaxEnt
In this section, we provide a small overview of the probabilistic approach for the IRL problem under two existing kinds of the regular structure on reward functions. One kind is of a linear structure on the reward function. A reward function is always written as $\mathcal{R}(s)={\sum}_{i}^{M}{\alpha}_{i}\cdot {\varphi}_{i}(s)$, where ${\varphi}_{i}:\mathcal{S}\to {\mathbb{R}}^{d}$ are a $d$dimensional feature functions given by a handselection of experts [1]. Ziebart et al. [6] propose a probabilistic approach dealing with the ambiguity of the IRL problem based on the principle of maximum entropy, which is called Maximum entropy IRL (MaxEnt). In MaxEnt, we always assume that trajectories with higher rewards are exponentially more preferred $\mathbb{P}(\zeta R)\propto \mathrm{exp}{\sum}_{s\in \zeta}R(s)$, where $\zeta $ is one trajectory from expert demonstrations. The objective function for MaxEnt is derived from maximizing the likelihood of expert trajectory under the maximum entropy, and it is always convex for deterministic MDPs. Typically, the optimum is obtained by a gradientbased method. The other kind is of a nonlinear structure on the reward function which is of the form $\mathcal{R}(s)=\mathcal{F}({\varphi}_{1}(s),\mathrm{\cdots},{\varphi}_{M}(s))$, where $\mathcal{F}$ is a nonlinear function of feature basis ${\{{\varphi}_{i}(s)\}}_{i=1}^{M}$. Following the principle of maximum entropy, Wulfmeier et al. [7] extend MaxEnt by adopting a neural networkbased approach approximating the unknown nonlinear reward, which is called Maximum entropy deep IRL (DeepMaxEnt).
To generalize the classic regular structures on the reward function, we propose a stochastic regular structure on the reward function in the following section.
2.2 Problem Statement
Formally, we are given an MDP$\backslash $R$=(\mathcal{S},\mathcal{A},\mathcal{T},\gamma )$ with a known transition function $\mathcal{T}=\mathbb{P}({s}_{t+1}={s}^{\prime}{s}_{t}=s,{a}_{t}=a)$ for $s,{s}^{\prime}\in \mathcal{S}$ and $a\in \mathcal{A}$ and a handcrafted reward feature basis ${\{{\varphi}_{i}(s)\}}_{i=1}^{M}$. A stochastic regular structure on the reward function assumes weights $\mathcal{W}$ of the reward feature functions ${\varphi}_{i}(s)$, which are random variables with a reward conditional probability distribution ${\mathcal{D}}^{\mathcal{R}}(\mathcal{W}{\zeta}^{E})$ conditional on expert demonstrations ${\zeta}^{E}$. Parametrizing ${\mathcal{D}}^{\mathcal{R}}(\mathcal{W}{\zeta}^{E})$ with parameter $\mathrm{\Theta}$, our aim is to estimate the bestfitting parameter ${\mathrm{\Theta}}^{*}$ from the expert demonstrations ${\zeta}^{E}$, such that ${\mathcal{D}}^{\mathcal{R}}(\mathcal{W}{\zeta}^{E},{\mathrm{\Theta}}^{*})$ more likely generates weights to compose reward functions as the ones derived from expert demonstrations, which is called stochastic inverse reinforcement learning problem.
In practice, expert demonstration ${\zeta}^{E}$ can be observed but lack of sampling representativeness [13]. For example, one driver’s demonstrations encode his own preferences in driving style but may not reflect the true rewards of an environment. To overcome this limitation, we introduce a representative trajectory class ${\mathcal{C}}_{\u03f5}^{E}$ such that each trajectory element set $\mathcal{O}\in {\mathcal{C}}_{\u03f5}^{E}$ is a subset of expert demonstrations ${\zeta}^{E}$ with the cardinality at least $\u03f5\cdot {\zeta}^{E}$, where $\u03f5$ is a preset threshold and ${\zeta}^{E}$ is the number of expert demonstrations, and it is written as ${\mathcal{C}}_{\u03f5}^{E}:=\left\{\mathcal{O}\right\mathcal{O}\subset {\zeta}^{E}\text{with}\mathcal{O}\ge \u03f5\cdot {\zeta}^{E}\}$.
We integrate out unobserved weights $\mathcal{W}$, and then SIRL problem is formulated to estimate parameter $\mathrm{\Theta}$ on an expectation optimization problem over the representative trajectory class as follows:
${\mathrm{\Theta}}^{*}:=\mathrm{arg}\underset{\mathrm{\Theta}}{\mathrm{max}}{\mathbb{E}}_{\mathcal{O}\in {\mathcal{C}}_{\u03f5}^{E}}\left[{\displaystyle {\int}_{\mathcal{W}}}{f}_{\mathcal{M}}(\mathcal{O},\mathcal{W}\mathrm{\Theta})\mathit{d}\mathcal{W}\right],$  (1) 
where trajectory element set $\mathcal{O}$ assumes to be uniformly distributed for the sake of simplicity in this study but usually known from the rough estimation of the statistics in expert demonstrations in practice, and ${f}_{\mathcal{M}}$ is the conditional joint probability density function of trajectory element $\mathcal{O}$ and weights $\mathcal{W}$ for reward feature functions conditional on parameter $\mathrm{\Theta}$.
In the following section, we propose a novel approach to estimate the bestfitting parameter ${\mathrm{\Theta}}^{*}$ in Equation 1, which is called the twostage hierarchical method, a variant of MCEM method.
3 Methodology
3.1 Twostage Hierarchical Method
The twostage hierarchical method requires us to write parameter $\mathrm{\Theta}$ in a profile form $\mathrm{\Theta}:=({\mathrm{\Theta}}_{1},{\mathrm{\Theta}}_{2})$. The conditional joint density ${f}_{\mathcal{M}}(\mathcal{O},\mathcal{W}\mathrm{\Theta})$ in Equation 1 can be written as the product of two conditional densities ${g}_{\mathcal{M}}$ and ${h}_{\mathcal{M}}$ as follows:
${f}_{\mathcal{M}}(\mathcal{O},\mathcal{W}{\mathrm{\Theta}}_{1},{\mathrm{\Theta}}_{2})={g}_{\mathcal{M}}(\mathcal{O}\mathcal{W},{\mathrm{\Theta}}_{1})\cdot {h}_{\mathcal{M}}(\mathcal{W}{\mathrm{\Theta}}_{2}).$  (2) 
Take the log of both sides in Equation 2, and we have
$\mathrm{log}{f}_{\mathcal{M}}(\mathcal{O},\mathcal{W}{\mathrm{\Theta}}_{1},{\mathrm{\Theta}}_{2})=\mathrm{log}{g}_{\mathcal{M}}(\mathcal{O}\mathcal{W},{\mathrm{\Theta}}_{1})+\mathrm{log}{h}_{\mathcal{M}}(\mathcal{W}{\mathrm{\Theta}}_{2}).$  (3) 
We optimize the right side of Equation 3 over the profile parameter $\mathrm{\Theta}$ in the expectationmaximization (EM) update steps at the $t$th iteration independently as follows,
${\mathrm{\Theta}}_{1}^{t+1}:$  $=\mathrm{arg}\underset{{\mathrm{\Theta}}_{1}}{\mathrm{max}}\mathbb{E}\left(\mathrm{log}{g}_{\mathcal{M}}(\mathcal{O}\mathcal{W},{\mathrm{\Theta}}_{1}){\mathcal{C}}_{\u03f5}^{E},{\mathrm{\Theta}}^{t}\right);$  (4)  
${\mathrm{\Theta}}_{2}^{t+1}:$  $=\mathrm{arg}\underset{{\mathrm{\Theta}}_{2}}{\mathrm{max}}\mathbb{E}\left(\mathrm{log}{h}_{\mathcal{M}}(\mathcal{W}{\mathrm{\Theta}}_{2}){\mathrm{\Theta}}^{t}\right).$  (5) 
3.1.1 Initialization
We randomly initialize profile parameter ${\mathrm{\Theta}}^{0}:=({\mathrm{\Theta}}_{1}^{0},{\mathrm{\Theta}}_{2}^{0})$ and sample a collection of ${N}_{0}$ rewards weights $\{{\mathcal{W}}_{1}^{{\mathrm{\Theta}}^{0}},\mathrm{\cdots},{\mathcal{W}}_{N}^{{\mathrm{\Theta}}^{0}}\}\sim {\mathcal{D}}^{\mathcal{R}}(\mathcal{W}{\zeta}^{E},{\mathrm{\Theta}}_{2}^{0})$. The reward weights ${\mathcal{W}}_{i}^{{\mathrm{\Theta}}^{0}}$ compose reward ${R}_{{\mathcal{W}}_{i}^{{\mathrm{\Theta}}^{0}}}$ in each learning task ${\mathcal{M}}_{i}^{0}:=(\mathcal{S},\mathcal{A},\mathcal{T},{R}_{{\mathcal{W}}_{i}^{{\mathrm{\Theta}}^{0}}},\gamma )$ for $i=1\mathrm{\cdots}{N}_{0}$.
3.1.2 First Stage
In the first stage, we aim to update parameter ${\mathrm{\Theta}}_{1}$ for the intractable expectation in Equation 4 in each iteration. Specifically, we take a Monte Carlo method to estimate model parameters ${\mathrm{\Theta}}_{1}^{t+1}$ in an empirical expectation at the $t$th iteration,
$\mathcal{E}\left[\mathrm{log}{g}_{\mathcal{M}}(\mathcal{O}\mathcal{W},{\mathrm{\Theta}}_{1}^{t+1}){\mathcal{C}}_{\u03f5}^{E},{\mathrm{\Theta}}^{t}\right]:={\displaystyle \frac{1}{{N}_{t}}}\cdot {\displaystyle \sum _{i=1}^{{N}_{t}}}\mathrm{log}{g}_{{\mathcal{M}}_{i}^{t}}({\mathcal{O}}_{i}^{t}{\mathcal{W}}_{i}^{{\mathrm{\Theta}}^{t}},{\mathrm{\Theta}}_{1}^{t+1}),$  (6) 
where reward weights at the $t$th iteration ${\mathcal{W}}_{i}^{{\mathrm{\Theta}}^{t}}$ are randomly drawn from the reward conditional probability distribution ${\mathcal{D}}^{\mathcal{R}}(\mathcal{W}{\zeta}^{E},{\mathrm{\Theta}}^{t})$ and compose a set of learning tasks ${\mathcal{M}}_{i}^{t}:=(\mathcal{S},\mathcal{A},\mathcal{T},{R}_{{\mathcal{W}}_{i}^{{\mathrm{\Theta}}^{t}}},\gamma )$ with a trajectory element set ${\mathcal{O}}_{i}^{t}$ uniformly drawn from representative trajectory class ${\mathcal{C}}_{\u03f5}^{E}$, for $i=1,\mathrm{\cdots},{N}_{t}$.
The parameter ${\mathrm{\Theta}}_{1}^{t+1}$ in Equation 6 has ${N}_{t}$ coordinates written as ${\mathrm{\Theta}}_{1}^{t+1}:=({({\mathrm{\Theta}}_{1}^{t+1})}_{1},\mathrm{\cdots},{({\mathrm{\Theta}}_{1}^{t+1})}_{{N}_{t}})$. For each learning task ${\mathcal{M}}_{i}^{t}$, the $i$th coordinate ${({\mathrm{\Theta}}_{1}^{t+1})}_{i}$ is derived from maximization of a posteriori, the same trick as the ones in MaxEnt and DeepMaxEnt [6; 7], as follows:
$${({\mathrm{\Theta}}_{1}^{t+1})}_{i}:=\mathrm{arg}\underset{\theta}{\mathrm{max}}\mathrm{log}{g}_{{\mathcal{M}}_{i}^{t}}\left({\mathcal{O}}_{i}^{t}{\mathcal{W}}_{i}^{{\mathrm{\Theta}}^{t}},\theta \right),$$ 
which is a convex formulation optimized in a gradient ascent method.
In practice, we move $m$ steps uphill to the optimum in each learning task ${\mathcal{M}}_{i}^{t}$. The update formula of $m$step reward weights $\mathcal{W}^{m}{}_{i}{}^{{\mathrm{\Theta}}^{t}}$ is written as
$$\mathcal{W}^{m}{}_{i}{}^{{\mathrm{\Theta}}^{t}}:={\mathcal{W}}_{i}^{{\mathrm{\Theta}}^{t}}+\sum _{i=1}^{m}{\lambda}_{i}^{t}\cdot {\nabla}_{{({\mathrm{\Theta}}_{1})}_{i}}\mathrm{log}{g}_{{\mathcal{M}}_{i}^{t}}\left({\mathcal{O}}_{i}^{t}{\mathcal{W}}_{i}^{{\mathrm{\Theta}}^{t}},{({\mathrm{\Theta}}_{1})}_{i}\right),$$ 
where the learning rate ${\lambda}_{i}^{t}$ at the $t$th iteration is preset. Hence, the parameter ${\mathrm{\Theta}}_{1}^{t+1}$ in practice is represented as ${\mathrm{\Theta}}_{1}^{t+1}:=(\mathcal{W}^{m}{}_{1}{}^{{\mathrm{\Theta}}^{t}},\mathrm{\cdots},\mathcal{W}^{m}{}_{{N}_{t}}{}^{{\mathrm{\Theta}}^{t}})$.
3.1.3 Second Stage
In the second stage, we aim to update parameter ${\mathrm{\Theta}}_{2}$ for the intractable expectation in Equation 5 in each iteration. Specifically, we consider the empirical expectation at the $t$th iteration as follows,
$\mathcal{E}\left(\mathrm{log}{h}_{\mathcal{M}}(\mathcal{W}{\mathrm{\Theta}}_{2}^{t+1}){\mathrm{\Theta}}^{t}\right):={\displaystyle \frac{1}{{N}_{t}}}\cdot {\displaystyle \sum _{i=1}^{{N}_{t}}}\mathrm{log}{h}_{{\mathcal{M}}_{i}^{t}}(\mathcal{W}^{m}{}_{i}{}^{{\mathrm{\Theta}}^{t}}{\mathrm{\Theta}}_{2}^{t+1}).$  (7) 
where ${h}_{\mathcal{M}}$ is implicit but fitting a set of $m$step reward weights ${\{\mathcal{W}^{m}{}_{i}{}^{{\mathrm{\Theta}}^{t}}\}}_{i=1}^{{N}_{t}}$ in a generative model yields a large empirical expectation value. The reward conditional probability distribution ${\mathcal{D}}^{\mathcal{R}}(\mathcal{W}{\zeta}^{E},{\mathrm{\Theta}}_{2}^{t+1})$ is the generative model formulated in a Gaussian Mixture Model (GMM) in practice, i.e.
$${\mathcal{D}}^{\mathcal{R}}(\mathcal{W}{\zeta}^{E},{\mathrm{\Theta}}_{2}^{t+1}):=\sum _{k=1}^{K}{\alpha}_{k}\cdot \mathcal{N}(\mathcal{W}{\mu}_{k},{\mathrm{\Sigma}}_{k})$$ 
with ${\alpha}_{k}\ge 0$, ${\sum}_{k=1}^{K}{\alpha}_{k}=1$, and parameter set ${\mathrm{\Theta}}_{2}^{t+1}:={\{{\alpha}_{k};{\mu}_{k},{\mathrm{\Sigma}}_{k}\}}_{k=1}^{K}$.
We estimate parameter ${\mathrm{\Theta}}_{2}^{t+1}$ in GMM via an EM approach and initialize GMM with the $t$th iteration parameter ${\mathrm{\Theta}}_{2}^{t}$. The EM procedures are given as follows: for $i=1,\mathrm{\cdots},{N}_{t}$,

•
Expectation Step: Compute responsibility ${\gamma}_{ij}$ for $m$step reward weight $\mathcal{W}^{m}{}_{i}{}^{{\mathrm{\Theta}}^{t}}$,
$${\gamma}_{ij}:=\frac{{\alpha}_{j}\cdot \mathcal{N}(\mathcal{W}^{m}{}_{i}{}^{{\mathrm{\Theta}}^{t}}{\mu}_{j},{\mathrm{\Sigma}}_{j})}{{\sum}_{k=1}^{K}{\alpha}_{k}\cdot \mathcal{N}(\mathcal{W}^{m}{}_{i}{}^{{\mathrm{\Theta}}^{t}}{\mu}_{k},{\mathrm{\Sigma}}_{k})}.$$ 
•
Maximization Step: Compute weighted mean ${\mu}_{j}$ and variance ${\mathrm{\Sigma}}_{j}$,
${\mu}_{j}:$ $={\displaystyle \frac{{\sum}_{i=1}^{{N}_{t}}{\gamma}_{ij}\cdot \mathcal{W}^{m}{}_{i}{}^{{\mathrm{\Theta}}^{t}}}{{\sum}_{i=1}^{{N}_{t}}{\gamma}_{ij}}};{\alpha}_{j}:={\displaystyle \frac{1}{{N}_{t}}}\cdot {\displaystyle \sum _{i=1}^{{N}_{t}}}{\gamma}_{ij};$ ${\mathrm{\Sigma}}_{j}:$ $={\displaystyle \frac{{\sum}_{i=1}^{{N}_{t}}{\gamma}_{ij}\cdot (\mathcal{W}^{m}{}_{i}{}^{{\mathrm{\Theta}}^{t}}{\mu}_{j})\cdot {(\mathcal{W}^{m}{}_{i}{}^{{\mathrm{\Theta}}^{t}}{\mu}_{j})}^{T}}{{\sum}_{i=1}^{{N}_{t}}{\gamma}_{ij}}}.$
After the EM converges, parameter ${\mathrm{\Theta}}_{2}^{t+1}:={\{{\alpha}_{k};{\mu}_{k},{\mathrm{\Sigma}}_{k}\}}_{k=1}^{K}$ of GMM in this iteration, and profile parameter ${\mathrm{\Theta}}^{t+1}:=({\mathrm{\Theta}}_{1}^{t+1},{\mathrm{\Theta}}_{2}^{t+1})$.
Finally, when the twostage hierarchical method converges, our desired bestfitting parameter ${\mathrm{\Theta}}^{*}$ in ${\mathcal{D}}^{\mathcal{R}}(\mathcal{W}{\zeta}^{E},{\mathrm{\Theta}}^{*})$ is parameter ${\mathrm{\Theta}}_{2}$ in profile parameter $\mathrm{\Theta}$.
3.2 Termination Criteria
In this section, we will talk about the termination criteria in our algorithm. An ordinary EM algorithm terminates usually when the parameters do not substantively change after enough iterations. For example, one classic termination criterion in the EM algorithm terminates at the $t$th iteration satisfying,
$$ 
for userspecified ${\delta}_{EM}$ and ${\u03f5}_{EM}$, where $\theta $ is the model parameter in the EM algorithm.
However, such a termination criterion in MCEM has the risk of terminating early because of the Monte Carlo error in the update step. Hence, we adopt a practical method in which the following similar stopping criterion holds in three consecutive times,
$$ 
for userspecified ${\delta}_{MCEM}$ and ${\u03f5}_{MCEM}$ [14]. For various other stopping criteria of MCEM in the literature refers to [15; 16].
3.3 Convergence Issue
The convergence issue of MCEM is more complicated than ordinary EM. In light of modelbased interactive MDP$\backslash $R, we can always increase the sample size of MCEM per iteration. We require the Monte Carlo sample size per iteration in practice satisfy the following inequality,
$$ 
An additional requirement is that parameter space should be compact for the convergence property. For a comprehensive proof, refer to [16; 17].
A pseudocode of our approach is given in Algorithm 1.
${\mu}_{j}$  $\leftarrow {\displaystyle \frac{{\sum}_{i=1}^{{N}_{t}}{\gamma}_{ij}\cdot \mathcal{W}^{m}{}_{i}{}^{{\mathrm{\Theta}}^{t}}}{{\sum}_{i=1}^{{N}_{t}}{\gamma}_{ij}}};{\alpha}_{j}\leftarrow {\displaystyle \frac{1}{{N}_{t}}}\cdot {\displaystyle \sum _{i=1}^{{N}_{t}}}{\gamma}_{ij};$  
${\mathrm{\Sigma}}_{j}$  $\leftarrow {\displaystyle \frac{{\sum}_{i=1}^{{N}_{t}}{\gamma}_{ij}\cdot (\mathcal{W}^{m}{}_{i}{}^{{\mathrm{\Theta}}^{t}}{\mu}_{j})\cdot {(\mathcal{W}^{m}{}_{i}{}^{{\mathrm{\Theta}}^{t}}{\mu}_{j})}^{T}}{{\sum}_{i=1}^{{N}_{t}}{\gamma}_{ij}}};$ 
4 Experiments
We evaluate our approach on an environment, objectworld, which is a particularly challenging environment with a large number of irrelevant features and the highly nonlinearity of the reward functions. We employ the expected value difference (EVD) to be the metric of optimality as follows:
$$\text{EVD}(\mathcal{W}):=\mathbb{E}\left[\sum _{t=0}^{\mathrm{\infty}}{\gamma}^{t}\cdot R({s}_{t}){\pi}^{*}\right]\mathbb{E}\left[\sum _{t=0}^{\mathrm{\infty}}{\gamma}^{t}\cdot R({s}_{t})\pi (\mathcal{W})\right],$$ 
which is a measure of the difference between the expected reward earned under the optimal policy ${\pi}^{*}$, given by the true reward, and the policy derived from the rewards sampling from our reward conditional probability distribution $\mathcal{D}(\mathcal{W}{\zeta}^{E},{\mathrm{\Theta}}^{*})$. Notice that we use ${\mathrm{\Theta}}^{*}$ to denote the best estimation parameter in our approach.
4.1 Objectworld
The objectworld is an IRL environment proposed by Levine et al. [5] which is an $N\times N$ grid board with colored objects placed in randomly selected cells. Each colored object is assigned one inner color and one outer color from $C$ preselected colors. Each cell on the grid board is a state, and stepping to four neighbor cells (up, down, left, right) or staying in place (stay) are five actions with a 30% chance of moving in a random direction.
The ground truth of reward function is defined in the following way. Suppose two primary colors of $C$ preselected colors are red and blue. The reward of a state is 1 if the state is within 3 steps of an outer red object and 2 steps of an outer blue object, 1 if the state is within 3 steps of an outer red object, and 0 otherwise. The other pairs of inner and outer colors are distractors. Continuous and discrete versions of feature basis functions are provided. For the continuous version, $\varphi (s)$ is a $2C$dimensional realvalued feature vector. Each dimension records the Euclidean distance from the state to objects. For example, the first and second coordinates are the distances to the nearest inner and outer red object respectively, and so on through all $C$ colors. For the discrete version, $\varphi (s)$ is a $(2C\cdot N)$dimensional binary feature vector. Each $N$dimensional vector records a binary representation of distance to the nearest inner or outer color object with the $d$th coordinate 1 if the corresponding continuous distance is less than $d$.
4.2 Evaluation Procedure and Analysis
In this section, we design several tasks to evaluate our generative model, reward conditional probability distribution $\mathcal{D}(\mathcal{W}{\zeta}^{E},{\mathrm{\Theta}}^{*})$. For each task, the environment setting is as follows. The instance of $10\times 10$ objectworld has 25 random objects with 2 colors and 0.9 discount factor. 200 expert demonstrations are generated according to the true optimal policy for the recovery. The length of each expert demonstration is 5 grid size trajectory length. We have four algorithms in the evaluation including MaxEnt, DeepMaxEnt, SIRL, and DSIRL. SIRL and DSIRL are implemented as in Algorithm 1 with an assumption of the linear and nonlinear structure of reward functions respectively, i.e. the drawn weights from reward conditional probability distribution will compose the coefficients in a linear or nonlinear combination of feature functions.
In our evaluation, SIRL and DSIRL start from 10 samples and double the sample size per iteration until it converges. In the first stage, the epochs of algorithm iteration are set to 20 and the learning rates are 0.01. The parameter $\u03f5$ in representative trajectory set ${\mathcal{O}}_{\u03f5}^{E}$ is preset to 0.95. In the second stage, GMM in both SIRL and DSIRL has three components with at most 1000 iterations before convergence. Additionally, the neural networks for DeepMaxEnt and DSIRL are both implemented in a 3layer fullyconnected architecture with the sigmoid function as the activation function.
4.2.1 Evaluation Platform
All the methods are implemented in Python 3.5 and Theano 1.0.0 with a machine learning distributed framework, Ray [18]. The experiments are conducted on a machine with Intel(R) Core(TM) i77700 CPU @ 3.60GHz and Nvidia GeForce GTX 1070 GPU.
4.2.2 Recovery Experiment
In this experiment, we aim to compare the ground truth of the reward function, the optimal policy, and the optimal value with the ones derived from four algorithms on objectworld. For SIRL and DSIRL, the mean of reward conditional probability distribution is used as the comparison object. In Figure 1, we notice that the mean of DSIRL performs better than DeepMaxEnt, and the mean of SIRL is better than MaxEnt because of our Monte Carlo mechanism, a global search approach, in our algorithm. Both MaxEnt and DeepMaxEnt are very likely to get trapped in a secondary maximum. Because of highly nonlinear ground truth, the mean of DSIRL beats the mean of SIRL in this task. Our generalized perspective can be regarded as an average of all outcomes which is prone to imitate the true optimal value from finite expert demonstrations.
4.2.3 Generative Experiment
In this experiment, we aim to evaluate the generativeness of the reward conditional probability distribution $\mathcal{D}(\mathcal{W}{\zeta}^{E},{\mathrm{\Theta}}^{*})$ which will generate more than one reward function as the solution for the IRL problem. Each optimal value derived from the drawn reward has a small EVD value compared with the true original reward. We design an generative algorithm to capture robust solutions. The pseudocode of generative algorithm is given in Algorithm 2, and for experimental result refers to Figure 2.
In the generative algorithm, we always use the Frobenius norm to measure the distance between weights (matrices) drawn from the reward conditional probability distribution, given by
$${\mathcal{W}}_{\mathcal{F}}:=\sqrt{\mathrm{Tr}(\mathcal{W}\cdot {\mathcal{W}}^{T})}.$$ 
Each drawn weight $\mathcal{W}\sim \mathcal{D}(\mathcal{W}{\zeta}^{E},{\mathrm{\Theta}}^{*})$ in the solution set $\mathcal{G}$ should satisfy
$$ 
where ${\mathcal{W}}^{\prime}$ represents any other member in the solution set $\mathcal{G}$, and $\delta ,\u03f5$ are the preset thresholds in the generative algorithm.
4.2.4 Hyperparameter Experiment
In this experiment, we aim to evaluate the effectiveness of our approach under the influence of preset variant quantities and qualities of expert demonstrations. The amount of information carried in expert demonstrations will compose a specific learning environment, and hence has an impact on the effectiveness of our generative model. Due to page limit, we only verify three hyperparameters including the number of expert demonstrations in Figure 5, the trajectory length of expert demonstrations in Figure 5 and the portion size in representative trajectory class ${\mathcal{C}}_{\u03f5}^{E}$ in Figure 5 on objectworld. The Shadow of the line in the figures represents the standard error for each experimental trail. Notice that the EVDs for SIRL and DSIRL are both decreasing as the number and the trajectory length of expert demonstrations, and the portion size in the representative trajectory class are increasing.
5 Conclusion
In this paper, we propose a generalized perspective for IRL problem called stochastic inverse reinforcement learning problem. We formulate it as an expectation optimization problem and adopt the MCEM method to give the first solution to it. The solution to SIRL gives a generative model to produce more than one reward function for original IRL problem, making it possible to analyze and display highly nonlinear IRL problem from a global viewpoint. The experimental results demonstrate the recovery and generative ability of the generative model under the comparison metric EVD. We then show the effectiveness of our model under the influence of a set of hyperparameters of expert demonstrations.
References
 [1] Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, page 2, 2000.
 [2] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twentyfirst international conference on Machine learning, page 1. ACM, 2004.
 [3] Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pages 729–736. ACM, 2006.
 [4] Pieter Abbeel, Dmitri Dolgov, Andrew Y Ng, and Sebastian Thrun. Apprenticeship learning for motion planning with application to parking lot navigation. In 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1083–1090. IEEE, 2008.
 [5] Sergey Levine, Zoran Popovic, and Vladlen Koltun. Nonlinear inverse reinforcement learning with gaussian processes. In Advances in Neural Information Processing Systems, pages 19–27, 2011.
 [6] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
 [7] Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. Maximum entropy deep inverse reinforcement learning. arXiv preprint arXiv:1507.04888, 2015.
 [8] Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. In IJCAI, volume 7, pages 2586–2591, 2007.
 [9] Greg CG Wei and Martin A Tanner. A monte carlo implementation of the em algorithm and the poor man’s data augmentation algorithms. Journal of the American statistical Association, 85(411):699–704, 1990.
 [10] Henrik Kretzschmar, Markus Spies, Christoph Sprunk, and Wolfram Burgard. Socially compliant mobile robot navigation via inverse reinforcement learning. The International Journal of Robotics Research, 35(11):1289–1307, 2016.
 [11] Pieter Abbeel, Adam Coates, Morgan Quigley, and Andrew Y Ng. An application of reinforcement learning to aerobatic helicopter flight. In Advances in neural information processing systems, pages 1–8, 2007.
 [12] Dizan Vasquez, Billy Okal, and Kai O Arras. Inverse reinforcement learning algorithms and features for robot navigation in crowds: an experimental comparison. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1341–1346. IEEE, 2014.
 [13] William Kruskal and Frederick Mosteller. Representative sampling, iii: The current statistical literature. International Statistical Review/Revue Internationale de Statistique, pages 245–265, 1979.
 [14] James G Booth and James P Hobert. Maximizing generalized linear mixed model likelihoods with an automated monte carlo em algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(1):265–285, 1999.
 [15] Brian S Caffo, Wolfgang Jank, and Galin L Jones. Ascentbased monte carlo expectation–maximization. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):235–251, 2005.
 [16] KS Chan and Johannes Ledolter. Monte carlo em estimation for time series models involving counts. Journal of the American Statistical Association, 90(429):242–252, 1995.
 [17] Gersende Fort, Eric Moulines, et al. Convergence of the monte carlo expectation maximization for curved exponential families. The Annals of Statistics, 31(4):1220–1259, 2003.
 [18] Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. Ray: A distributed framework for emerging $\{$AI$\}$ applications. In 13th $\mathrm{\{}$USENIX$\mathrm{\}}$ Symposium on Operating Systems Design and Implementation ($\mathrm{\{}$OSDI$\mathrm{\}}$ 18), pages 561–577, 2018.