Abstract
Reinforcement learning with complex tasks is a challenging problem. Often,expert demonstrations of complex multitasking operations are required to trainagents. However, it is difficult to design a reward function for given complextasks. In this paper, we solve a hierarchical inverse reinforcement learning(IRL) problem within the framework of options. A gradient method forparametrized options is used to deduce a defining equation for the Qfeaturespace, which leads to a reward feature space. Using a secondorder optimalitycondition for option parameters, an optimal reward function is selected.Experimental results in both discrete and continuous domains confirm that oursegmented rewards provide a solution to the IRL problem for multitaskingoperations and show good performance and robustness against the noise createdby expert demonstrations.
Quick Read (beta)
Option Compatible Reward Inverse Reinforcement Learning
Abstract
Reinforcement learning with complex tasks is a challenging problem. Often, expert demonstrations of complex multitasking operations are required to train agents. However, it is difficult to design a reward function for given complex tasks. In this paper, we solve a hierarchical inverse reinforcement learning (IRL) problem within the framework of options. A gradient method for parametrized options is used to deduce a defining equation for the Qfeature space, which leads to a reward feature space. Using a secondorder optimality condition for option parameters, an optimal reward function is selected. Experimental results in both discrete and continuous domains confirm that our segmented rewards provide a solution to the IRL problem for multitasking operations and show good performance and robustness against the noise created by expert demonstrations.
1 Introduction
The reinforcementlearning (RL) method seeks an optimal policy for a given reward function in a Markov decision process (MDP). There are several circumstances in which an agent can learn only from an expert demonstration, because it is difficult to prescribe a proper reward function for a given task. Inverse reinforcement learning (IRL) aims to find a reward function for which the expert’s policy is optimized. When the IRL method is applied to a large domain, the size of each trajectory of the required demonstration by the expert can be huge. There are also certain complex tasks that must be segmented into a sequence of subtasks (e.g., robotics of ubiquitous generalpurpose automation ([11] [13]), robotic surgical procedure training ([7], [14]), hierarchical human behavior modeling [21], and autonomous driving [16]). For such complex tasks, a problem designer can decompose it hierarchically. Then an expert can easily demonstrate it at different levels of implementation.
Another challenge with the IRL method is the design of feature spaces that capture the structure of the reward functions. Linear models for reward functions have been used in existing IRL algorithms. However, nonlinear models have recently been introduced [15], [6], [17]. Exploring more general feature spaces for reward functions becomes necessary when expert intuition is insufficient for designing good features, including linear models. This problem raises concerns, such as in the robotics field [20].
Regarding the first aspect of our problem, several works considered the decomposition of underlying reward functions for given expert demonstrations in RL and IRL problems ([9], [3], [12]). For hierarchical IRL problems, most of works focus on how to perform segmentation on demonstrations of complex tasks and find suitable reward functions. For the IRL problem in options framework, option discovery should be first carried out as a segmentation process. Since our work focuses on hierarchical extensions of policy gradient based IRL algorithms, we assign options for each given domain instead of applying certain option discovery algorithms.
To simultaneously resolve the problems of segmentation and feature construction , we propose a new method that applies the option framework presented by [2] to the compatible reward inverse reinforcement learning (CRIRL) [17], a recent work on generating a feature space of rewards. Our method is called Option CRIRL. Previous works on the selection of proper reward functions for the IRL problem require design features that consider the environment of the problem. However, the CRIRL algorithm directly provides a space of features from which compatible reward functions can be constructed.
The main contribution of our work comprises the following items.

•
New method of assigning taskwise reward functions for a hierarchical IRL problem is introduced. Although both OptionGAN [9] and our work use policy gradient methods as a common grounding component, the former work adopts GAN approach to solve the IRL problem while we construct an explicit defining equation for reward features.

•
For a multitask IRL problem, we simultaneously obtain feature spaces for subtasks. When a taskwise reward function is constructed by combining obtained features, they share the same parameters for combination. As a consequence, we can avoid the problem of taskwise feature design and perform a oneshot process of optimal parameter selection across the domain of entire task.

•
Our optional framework approach for solving the IRL problem in a multitask environment outperforms other algorithms that do not consider the multitask nature of the problem. While handling the termination of each subtask, introducing parameters to termination and subtask policy functions in the policy gradient framework allows us to perform fine tuning on subtask reward functions, providing a better reward selection. It also shows better robustness to noise created by expert policy than other algorithms without using a hierarchical learning framework. The noise robustness of our algorithm is enabled by a larger dimension of the feature space than that which is available to nonoptional CRIRL algorithms.
There are differences in several aspects between our work and some of recent works [9], [18] and [12] on segmentation of reward functions in IRL problems. [18] uses Bayesian nonparametric mixture models to simultaneously partition the demonstration and learn associated reward functions. It has an advantage in the case with domains in which subgoals of each subtask are definite. For such domains, a successful segmentation simply defines taskwise reward functions. However, our work allows for indefiniteness of subgoals for which an assignment of rewards is not simple. [12] focuses on segmentation using transitions defined as changes in local linearity about a kernel function. It assumes predesigned features for reward functions. On the other hand, our method does not assume any preknowledge on feature spaces.
2 Preliminaries
Markov decision process The Markov decision process comprises the state space, $S$, the action space, $A$, the transition function, $P:S\times A\to (S\to [0,1])$, and the reward function, $R:S\times A\to \mathbb{R}$. A policy is a probability distribution, $\pi :S\times A\to [0,1]$, over actions conditioned on the states. The value of a policy is defined as ${V}_{\pi}(s)={\mathbb{E}}_{\pi}[{\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}{R}_{t+1}{s}_{0}=s]$, and the actionvalue function is ${Q}_{\pi}(s,a)={\mathbb{E}}_{\pi}[{\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}{R}_{t+1}{s}_{0}=s,{a}_{0}=a]$, where $\gamma \in [0,1)$ is the discounted factor.
Policy Gradients Policy gradient methods [22] aim to optimize a parametrized policy, ${\pi}_{\theta}$, via the stochastic gradient ascent. In a discounted setting, the optimization of the expected $\gamma $discounted return, $\rho (\theta ,{s}_{0})={\mathbb{E}}_{{\pi}_{\theta}}[{\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}{R}_{{s}_{t}}{s}_{0}]$, is considered. It can be written as
$$\rho (\theta ,{s}_{0})={\int}_{S}{\mu}_{{\pi}_{\theta}}(s{s}_{0}){\int}_{A}{\pi}_{\theta}(as)R(sa)\mathit{d}a\mathit{d}s$$ 
where ${\mu}_{{\pi}_{\theta}}(s{s}_{0})={\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}P({s}_{t}=s{s}_{0},{\pi}_{\theta})$. The policy gradient theorem [22] states:
$${\nabla}_{\theta}\rho (\theta ,{s}_{0})={\int}_{S}{\int}_{A}{\mu}_{{\pi}_{\theta}}(s{s}_{0}){\nabla}_{\theta}{\pi}_{\theta}(as){Q}_{{\pi}_{\theta}}(s,a)\mathit{d}a\mathit{d}s.$$ 
CRIRL The CRIRL method is an algorithm that generates a set of base functions spanning the subspace of reward functions that cause the policy gradient to vanish. As input, a parametric policy space, ${\mathrm{\Pi}}_{\mathrm{\Theta}}=\{{\pi}_{\theta}:\theta \in \mathrm{\Theta}\subset {\mathbb{R}}^{k}\}$, and a set of trajectories from the expert policy, ${\pi}^{E}$, are taken. It first builds the features, $\{{\varphi}_{i}\}$, of the actionvalue function, which cause the policy gradient to vanish. These features can be transformed into reward features, $\{{\psi}_{i}\}$, via the Bellman equation (modelbased case) or rewardshaping [19](modelfree). Then, a reward function that maximizes the expected return is chosen by enforcing a secondorder optimality condition based on the policy Hessian [10], [8].
The options framework We use the options framework for the hierarchical learning. See [23] for details. A Markovian option, $\omega \in \mathrm{\Omega}$, is a triple $({I}_{\omega},{\pi}_{\omega},{\beta}_{\omega}),$ where ${I}_{\omega}$ is an initiation set, ${\pi}_{\omega}$ is an intraoption policy, and ${\beta}_{\omega}:S\to [0,1]$ is a termination function. Following [2], we consider the callandreturn option execution model in which a fixed trajectory of options is given. Let ${\pi}_{\omega ,\theta}$ denote the intraoption policy of option $\omega $ parametrized by $\theta $ and ${\beta}_{\omega ,\vartheta}$, the termination function of the same option parametrized by $\vartheta $.
Optioncritic architecture [2] proposed a method of option discovery based on gradient descent applied to the expected discounted return, defined by
$$\rho (\mathrm{\Omega},\theta ,\vartheta ,{s}_{0},{\omega}_{0})={E}_{\mathrm{\Omega},\theta ,\vartheta}[\sum _{t=0}^{\mathrm{\infty}}{\gamma}^{t}{R}_{t+1}{s}_{0},{\omega}_{0}].$$ 
The objective function used here depends on policy over options and the parameters for intraoption policies and termination functions. Its gradient with respect to these parameters is taken through the following equations: the optionvalue function can be written as
$${Q}_{\mathrm{\Omega}}(s,\omega )=\sum _{a}{\pi}_{\omega ,\theta}(as){Q}_{U}(s,\omega ,a)$$ 
where
$${Q}_{U}(s,\omega ,a)=R(s,a)+\gamma \sum _{{s}^{\prime}}P({s}^{\prime}s,a)U(\omega ,{s}^{\prime})$$ 
is the actionvalue function for the stateoption pair, and
$$U(\omega ,{s}^{\prime})=(1{\beta}_{\omega ,\vartheta}({s}^{\prime})){Q}_{\mathrm{\Omega}}({s}^{\prime},\omega )+{\beta}_{\omega ,\vartheta}({s}^{\prime}){V}_{\mathrm{\Omega}}({s}^{\prime})$$ 
is the optionvalue function upon arrival.
3 Generation of Qfeatures compatible with the optimal policy
The first step to obtain a reward function as a solution for a given IRL problem is to generate Qfeatures compatible with an expert policy using the gradient of expected discounted returns. We assume that the parametrized expert intraoption policies, ${\pi}_{\omega ,\theta}$, are differentiable with respect to $\theta $. By the intraoption policy gradient theorem [2], the gradient of the expected discounted return with respect to $\theta $ vanishes as in the following equation:
$${\nabla}_{\theta}\rho =\sum _{s,\omega}{\mu}_{\mathrm{\Omega}}(s,\omega {s}_{0},{\omega}_{0})\sum _{a}{\nabla}_{\theta}{\pi}_{\omega ,\theta}(as){Q}_{U}(s,\omega ,a)=0$$  (1) 
where ${\mu}_{\mathrm{\Omega}}(s,\omega {s}_{0},{\omega}_{0})$ is the occupancy measure of stateoption pairs.
The firstorder optimality condition, ${\nabla}_{\theta}\rho =0$, gives a defining equation for Qfeatures compatible with the optimal policy. It is convenient to define a subspace of such compatible Qfeatures in the Hilbert space of functions on $\mathrm{\Omega}\times S\times A$. We define the inner product:
$$ 
Consider the subspace, ${G}_{\pi}=\{{\nabla}_{\theta}\mathrm{log}{\pi}_{\omega ,\theta}\alpha :\alpha \in {\mathbb{R}}^{k}\}$, of the Hilbert space of functions on $\mathrm{\Omega}\times S\times A$ with the inner product defined above. Then, the space of Qfeatures can be represented by the orthogonal complement, ${G}_{\pi}^{\u27c2}$ of ${G}_{\pi}$.
Parametrization of terminations is expected to allow us to have more finely tuned optionwise reward functions in IRL problems. We can impose an additional optimality condition on the expected discounted return with respect to parameters of the termination function. Let
$$\widehat{\rho}(\mathrm{\Omega},\theta ,\vartheta ,{s}_{1},{\omega}_{0})={E}_{\mathrm{\Omega},\theta ,\vartheta}[\sum _{t=0}^{\mathrm{\infty}}{\gamma}^{t}{R}_{t+1}{\omega}_{0},{s}_{1}]$$ 
be the expected discounted return with initial condition $({s}_{1},{\omega}_{0}).$ By the termination gradient theorem [2], one has
$${\nabla}_{\vartheta}\widehat{\rho}=\sum _{{s}^{\prime},\omega}{\mu}_{\mathrm{\Omega}}({s}^{\prime},\omega {s}_{1},{\omega}_{0}){\nabla}_{\vartheta}{\beta}_{\omega ,\vartheta}({s}^{\prime}){A}_{\mathrm{\Omega}}({s}^{\prime},\omega )$$  (2) 
where ${A}_{\mathrm{\Omega}}$ is the advantage function over options ${A}_{\mathrm{\Omega}}({s}^{\prime},\omega )={Q}_{\mathrm{\Omega}}({s}^{\prime},\omega ){V}_{\mathrm{\Omega}}({s}^{\prime})$.
The vanishing equation (2) gives a constraint on the space of the $Q$feature, ${G}_{\pi}^{\u27c2}$. For simplicity, set ${\mu}_{1,\mathrm{\Omega}}({s}^{\prime},\omega )={\mu}_{\mathrm{\Omega}}({s}^{\prime},\omega {s}_{1},{\omega}_{0})$. The constraint equation for ${G}_{\pi}^{\u27c2}$ is given by
$$\sum _{\omega ,{s}^{\prime}}{\nabla}_{\vartheta}{\beta}_{\omega ,\vartheta}({s}^{\prime}){\mu}_{1,\mathrm{\Omega}}({s}^{\prime},\omega )({Q}_{\mathrm{\Omega}}({s}^{\prime},\omega )\sum _{{\omega}^{\prime}}{\pi}_{\mathrm{\Omega}}({\omega}^{\prime}{s}^{\prime}){Q}_{\mathrm{\Omega}}({s}^{\prime},{\omega}^{\prime}))=0$$  (3) 
where
$${Q}_{\mathrm{\Omega}}(s,\omega )=\sum _{a}{\pi}_{\omega ,\theta}(as){Q}_{U}(s,\omega ,a).$$ 
Thus, we can combine two linear equations (1), (3) for ${Q}_{U}$ to define the space of $Q$features.
Let $k$ be the size of parameters $\theta $ and let $p$ be the size of parameters $\vartheta $. Because ${\sum}_{a}{\pi}_{\omega ,\theta}(as)=1$ for each pair $(s,\omega )$, the rank of the equation (1) is not greater than $\mathrm{min}\{k,SA\mathrm{\Omega}S\mathrm{\Omega}\}$, and the rank of the equation (3) is not greater than $\mathrm{min}\{p,SA\mathrm{\Omega}S\mathrm{\Omega}\}$. Thus, the dimension of the space of Qfeatures defined by (1) and (3) is not less than $\mathrm{max}\{S\mathrm{\Omega},SA\mathrm{\Omega}(k+p)\}$.
4 Reward function from Qfunctions
For the MDP model, if two reward functions share the same optimal policy, then they satisfy the following([19]):
$${R}^{\prime}(s,a)=R(s,a)+\gamma \sum _{{s}^{\prime}}P({s}^{\prime}s,a)\chi ({s}^{\prime})\chi (s).$$ 
If we take $\chi =V$, then
$${R}^{\prime}(s,a)=Q(s,a)V(s)=Q(s,a)\sum _{{a}^{\prime}}\pi ({a}^{\prime}s)Q(s,{a}^{\prime}).$$ 
Because the $Q$value function depends on the option in the options framework, the potential function, $\chi $, also depends on the option. We thus need to consider rewardshaping with regards to the intraoption policy, ${\pi}_{\omega}$. Then, the reward functions also need to be defined in the intraoption sense. Thus, reward functions also depend on options. This viewpoint is essential to our work and is similar to the approach taken in [9], in which ${R}_{\omega}$, the reward option, was introduced corresponding to the intraoption policy, ${\pi}_{\omega}$. Reward functions, ${R}_{\omega},{R}_{\omega}^{\prime}$, sharing the same intraoption policy, ${\pi}_{\omega}$, satisfy
$${R}_{\omega}^{\prime}(s,a)={R}_{\omega}(s,a)+\gamma \sum _{{s}^{\prime}}P({s}^{\prime}s,a)\chi ({s}^{\prime},\omega )\chi (s,\omega ).$$ 
If we take $\chi (s,\omega )=U(\omega ,s)$, then
${R}_{\omega}^{\prime}(s,a)$  $=$  ${R}_{\omega}(s,a)+\gamma {\displaystyle \sum _{{s}^{\prime}}}P({s}^{\prime}s,a)U(\omega ,{s}^{\prime})U(\omega ,s)$  
$=$  ${Q}_{U}(s,\omega ,a)[(1\beta (s)){Q}_{\mathrm{\Omega}}(\omega ,s)+\beta (s){V}_{\mathrm{\Omega}}(s)]$  
$=$  ${Q}_{U}(s,\omega ,a){\displaystyle \sum _{{a}^{\prime}}}{\pi}_{\omega}(as){Q}_{U}(s,\omega ,a)+\beta (s){A}_{\mathrm{\Omega}}(s,\omega )$ 
This provides us with a way to generate reward functions from $Q$features in the options framework.
5 Reward selection via the secondorder optimality condition
Among the linear combinations of reward features constructed in the previous section, selecting a linear combination that maximizes $\rho (\theta )$ and $\widehat{\rho}(\vartheta )$ is required. For the purpose of optimization, we use the secondorder optimality condition based on the Hessian of $\rho (\theta )$ and $\widehat{\rho}(\vartheta )$.
Consider a trajectory, $\tau =(({s}_{0},{\omega}_{0},{a}_{0},{b}_{0}),\mathrm{\cdots},({s}_{T1},{\omega}_{T1},{a}_{T1},{b}_{T1}))$, with termination indicator ${b}_{t}$ and terminal state ${s}_{T}$. The termination indicator, ${b}_{t}$, is 1 if a previous option terminates at step $t$. Otherwise it is 0. The probability density of trajectory $\tau $ is given by
$${\mathbb{P}}_{\theta ,\vartheta}(\tau )={p}_{0}({s}_{0}){\delta}_{{b}_{0}=1}{\pi}_{\mathrm{\Omega}}({\omega}_{0}{s}_{0})\prod _{t=1}^{T1}\mathbb{P}({b}_{t},{\omega}_{t}{\omega}_{t1},{s}_{t})\prod _{t=0}^{T1}{\pi}_{{\omega}_{t}}({a}_{t}{s}_{t})p({s}_{t+1}{s}_{t},{a}_{t}),$$ 
where
$\mathbb{P}({b}_{t}=1,{\omega}_{t}{\omega}_{t1},{s}_{t})$  $={\beta}_{{\omega}_{t1}}({s}_{t}){\pi}_{\mathrm{\Omega}}({\omega}_{t}{s}_{t})$  
$\mathbb{P}({b}_{t}=0,{\omega}_{t}{\omega}_{t1},{s}_{t})$  $=(1{\beta}_{{\omega}_{t1}}({s}_{t})){\delta}_{{\omega}_{t}={\omega}_{t1}}.$ 
We denote the set of all possible trajectories by $\mathbb{T}$ and the $\gamma $discounted trajectory reward by $R(\tau )={\sum}_{t=0}^{T(\tau )}{\gamma}^{t}R({s}_{\tau ,t},{a}_{\tau ,t})$. Then, the objective function can be rewritten as
$$\rho (\mathrm{\Omega},\theta ,\vartheta ,{s}_{0},{\omega}_{0})=E[\sum _{t=0}^{\mathrm{\infty}}{\gamma}^{t}{R}_{t+1}{s}_{0},{\omega}_{0}]={\int}_{\mathbb{T}}{\mathbb{P}}_{\theta ,\vartheta}(\tau )R(\tau )\mathit{d}\tau .$$ 
Its gradient and Hessian with respect to $\theta $ can be expressed as
$${\nabla}_{\theta}\rho ={\int}_{\mathbb{T}}{\mathbb{P}}_{\theta ,\vartheta}(\tau ){\nabla}_{\theta}\mathrm{log}{\mathbb{P}}_{\theta ,\vartheta}(\tau )R(\tau )\mathit{d}\tau $$ 
and
$${\mathscr{H}}_{\theta}\rho ={\int}_{\mathbb{T}}{\mathbb{P}}_{\theta ,\vartheta}(\tau )({\nabla}_{\theta}\mathrm{log}{\mathbb{P}}_{\theta ,\vartheta}(\tau ){\nabla}_{\theta}\mathrm{log}{\mathbb{P}}_{\theta ,\vartheta}{(\tau )}^{T}+{\mathscr{H}}_{\theta}\mathrm{log}{\mathbb{P}}_{\theta ,\vartheta}(\tau ))R(\tau )\mathit{d}\tau .$$ 
The second objective function can be written as
$$\widehat{\rho}=E[\sum _{t=0}^{\mathrm{\infty}}{\gamma}^{t}{R}_{t+1}{\omega}_{0},{s}_{1}]={\int}_{\widehat{\mathbb{T}}}{\widehat{\mathbb{P}}}_{\theta ,\vartheta}(\widehat{\tau})R(\widehat{\tau})\mathit{d}\widehat{\tau},$$ 
where $\widehat{\tau}$ is a trajectory beginning with $({\omega}_{0},{s}_{1})$ with the probability distribution
$${\widehat{\mathbb{P}}}_{\theta ,\vartheta}(\widehat{\tau})={p}_{0}({\omega}_{0},{s}_{1})\prod _{t=1}^{T1}\mathbb{P}({b}_{t},{\omega}_{t}{\omega}_{t1},{s}_{t})\prod _{t=1}^{T1}{\pi}_{{\omega}_{t}}({a}_{t}{s}_{t})p({s}_{t+1}{s}_{t},{a}_{t}).$$ 
Then, its Hessian can be written as
$${\mathscr{H}}_{\vartheta}\widehat{\rho}={\int}_{\widehat{\mathbb{T}}}{\widehat{\mathbb{P}}}_{\theta ,\vartheta}(\widehat{\tau})({\nabla}_{\vartheta}\mathrm{log}{\widehat{\mathbb{P}}}_{\theta ,\vartheta}(\widehat{\tau}){\nabla}_{\vartheta}\mathrm{log}{\widehat{\mathbb{P}}}_{\theta ,\vartheta}{(\widehat{\tau})}^{T}+{\mathscr{H}}_{\vartheta}\mathrm{log}{\widehat{\mathbb{P}}}_{\theta ,\vartheta}(\widehat{\tau}))R(\widehat{\tau})\mathit{d}\widehat{\tau}.$$ 
Let $\{{\psi}_{\omega ,i}\}$ be the reward features constructed from the previous section. Rewrite each Hessian as ${\mathscr{H}}_{\theta}(\rho )={\sum}_{i}{w}_{i}{\mathscr{H}}_{\theta}{\rho}_{i},$ where ${\rho}_{i}$ is the expected return with respect to ${\mathbb{P}}_{\theta ,\vartheta}$ for the reward function, ${\psi}_{i}$, and as ${\mathscr{H}}_{\vartheta}(\widehat{\rho})={\sum}_{i}{w}_{i}{\mathscr{H}}_{\vartheta}{\widehat{\rho}}_{i},$ where ${\widehat{\rho}}_{i}$ is the expected return with respect to ${\widehat{\mathbb{P}}}_{\theta ,\vartheta}$ for the reward function, ${\psi}_{i}$. Set $t{r}_{\theta ,i}=tr({\mathscr{H}}_{\theta}({\rho}_{i}))$ and $t{r}_{\vartheta ,i}=tr({\mathscr{H}}_{\vartheta}({\widehat{\rho}}_{i}))$ for $i=1,\mathrm{\cdots},p$.
We want to determine the reward weight, $\mathbf{w}$, for the reward function, ${R}_{\omega}={\sum}_{i=1}^{p}{w}_{i}{\psi}_{\omega ,i}$, which yields a negative definite Hessian with a minimal trace. Geometrically, this corresponds to choosing a reward function that makes the expected returns locally the sharpest maximum points. Here, to relieve a computational burden, we exploit a heuristic method suggested by [17].
Thus, we only choose reward features having negative definite Hessians, compute the trace of each Hessian, and collect them in the vectors ${\mathrm{\mathbf{t}\mathbf{r}}}_{\theta}=(t{r}_{\theta ,i})$ and ${\mathrm{\mathbf{t}\mathbf{r}}}_{\vartheta}=(t{r}_{\vartheta ,i})$. We determine w by solving
$$\underset{\text{\mathbf{w}}}{\mathrm{min}}{\text{\mathbf{w}}}^{T}{\text{\mathbf{t}\mathbf{r}}}_{\theta},\text{and}\mathit{\hspace{1em}}\underset{\text{\mathbf{w}}}{\mathrm{min}}{\text{\mathbf{w}}}^{T}{\text{\mathbf{t}\mathbf{r}}}_{\vartheta}\mathit{\hspace{1em}\hspace{1em}}\text{s. t.}\mathit{\hspace{1em}\hspace{1em}}{\parallel \mathbf{w}\parallel}_{2}^{2}=1.$$ 
Typically, multiobjective optimization problems have no single solutions that optimize all objective functions simultaneously. One wellknown approach to tackling this problem is a linear scalarization. Thus, we consider the following singleobjective problem:
$$\underset{\text{\mathbf{w}}}{\mathrm{min}}{\lambda}_{\theta}{\text{\mathbf{w}}}^{T}{\text{\mathbf{t}\mathbf{r}}}_{\theta}+{\lambda}_{\vartheta}{\text{\mathbf{w}}}^{T}{\text{\mathbf{t}\mathbf{r}}}_{\vartheta}\mathit{\hspace{1em}\hspace{1em}}\text{s. t.}\mathit{\hspace{1em}\hspace{1em}}{\parallel \mathbf{w}\parallel}_{2}^{2}=1$$ 
with positive weights ${\lambda}_{\theta}$ and ${\lambda}_{\vartheta}$. A closedform solution is computed as $\mathbf{w}=({\lambda}_{\theta}{\text{\mathbf{w}}}^{T}{\text{\mathbf{t}\mathbf{r}}}_{\theta}+{\lambda}_{\vartheta}{\text{\mathbf{w}}}^{T}{\text{\mathbf{t}\mathbf{r}}}_{\vartheta})/\parallel {\lambda}_{\theta}{\text{\mathbf{w}}}^{T}{\text{\mathbf{t}\mathbf{r}}}_{\theta}+{\lambda}_{\vartheta}{\text{\mathbf{w}}}^{T}{\text{\mathbf{t}\mathbf{r}}}_{\vartheta}\parallel $. With a different choice of scalarization weights, ${\lambda}_{\theta}$ and ${\lambda}_{\vartheta}$, different reward functions can be produced. In practice, it is difficult to choose a proper weight pair, because the gap between the magnitudes of two trace vectors can be too large. Alternatively, we consider a weighted sum of solutions to the two singleobjective optimization problems: $\mathbf{w}={\alpha}_{\theta}({\text{\mathbf{t}\mathbf{r}}}_{\theta}/\parallel {\text{\mathbf{t}\mathbf{r}}}_{\theta}\parallel )+{\alpha}_{\vartheta}({\text{\mathbf{t}\mathbf{r}}}_{\vartheta}/\parallel {\text{\mathbf{t}\mathbf{r}}}_{\vartheta}\parallel )$, satisfying ${\alpha}_{\theta},{\alpha}_{\vartheta}>0$ and ${\alpha}_{\theta}^{2}+{\alpha}_{\vartheta}^{2}=1$. Thus, we cannot guarantee the obtained solution is Pareto optimal.
6 Algorithm
We summarize our algorithm of solving the IRL problem in the options framework as follows:
\[email protected]@algorithmic[1]
(1) Expert’s demotrajectories $D={\cup}_{i=1}^{N}\{({\omega}_{{\tau}_{i},1},{s}_{{\tau}_{i},0},{a}_{{\tau}_{i},0}),({\omega}_{{\tau}_{i},1},{s}_{{\tau}_{i},1},{a}_{{\tau}_{i},1}),\mathrm{\cdots},({\omega}_{{\tau}_{i},T({\tau}_{i})},{s}_{{\tau}_{i},T({\tau}_{i})},{a}_{{\tau}_{i},T({\tau}_{i}})\}$, (2) Option ${\mathrm{\Omega}}_{\theta ,\vartheta}=\{\omega \}$, for which expert’s policies, ${\pi}_{\omega ,\theta}$, and termination functions, ${\beta}_{\omega ,\vartheta}$, are parametrized, and a policy ${\pi}_{\mathrm{\Omega}}$ over options. \ENSUREReward function ${R}_{\omega}(s,a).$
Phase 1 \STATEEstimate ${d}_{\mu}^{{\pi}_{\theta}}\mathrm{}\mathrm{(}\omega \mathrm{,}s\mathrm{,}a\mathrm{)}\mathrm{=}{\mu}_{\mathrm{\Omega}}\mathrm{}\mathrm{(}s\mathrm{,}\omega \mathrm{}{s}_{\mathrm{0}}\mathrm{,}{\omega}_{\mathrm{0}}\mathrm{)}\mathrm{}{\pi}_{\omega \mathrm{,}\theta}\mathrm{}\mathrm{(}a\mathrm{}s\mathrm{)}$ for the visited stateaction pairs. \STATEGet the $\mathrm{}D\mathrm{}\mathrm{\times}\mathrm{}D\mathrm{}$ matrix ${D}_{\mu}^{{\pi}_{\theta}}\mathrm{=}\mathrm{\text{diag}}\mathrm{}\mathrm{(}{d}_{\mu}^{{\pi}_{\theta}}\mathrm{)}$ and $k\mathrm{\times}\mathrm{}D\mathrm{}$ matrix ${\mathrm{\nabla}}_{\theta}\mathrm{}\mathrm{log}\mathrm{}{\pi}_{\omega}\mathrm{}\mathrm{(}a\mathrm{}s\mathrm{)}$. \STATECompute the null space of ${\mathrm{\nabla}}_{\theta}\mathrm{}\mathrm{log}\mathrm{}{\pi}_{\omega}^{T}\mathrm{}{D}_{\mu}^{{\pi}_{\theta}}$. \STATEEstimate ${\mu}_{\mathrm{1}}\mathrm{=}{\mu}_{\mathrm{\Omega}}\mathrm{}\mathrm{(}s\mathrm{,}\omega \mathrm{}{s}_{\mathrm{1}}\mathrm{,}{\omega}_{\mathrm{0}}\mathrm{)}$ for the visited stateaction pairs. \STATEGet the matrices, $\mathrm{\text{diag}}\mathrm{}\mathrm{(}{\mu}_{\mathrm{1}}\mathrm{)}$, ${\mathrm{\nabla}}_{\vartheta}\mathrm{}\beta $, ${\mathrm{\Pi}}_{\mathrm{\Omega}}$, and $\mathrm{\Pi}$. \STATECompute the null space of ${\mathrm{\nabla}}_{\vartheta}\mathrm{}{\beta}^{T}\mathrm{}\mathrm{\text{diag}}\mathrm{}\mathrm{(}{\mu}_{\mathrm{1}}\mathrm{)}\mathrm{}\mathrm{(}I\mathrm{}{\mathrm{\Pi}}_{\mathrm{\Omega}}\mathrm{)}\mathrm{}\mathrm{\Pi}$. \STATEFind the intersection, $\mathrm{\Phi}$, of two null spaces. \STATEGet the set of advantage functions using $A\mathrm{=}\mathrm{(}I\mathrm{}{\mathrm{\Pi}}_{\mathrm{\Omega}}\mathrm{)}\mathrm{}\mathrm{\Pi}\mathrm{}\mathrm{\Phi}$.
Phase 2 \STATEGet the set of reward functions by applying reward shaping $\mathrm{\Psi}\mathrm{=}\mathrm{(}I\mathrm{}\mathrm{\Pi}\mathrm{)}\mathrm{}\mathrm{\Phi}\mathrm{+}\beta \mathrm{}A$. \STATEApply singular value decomposition to orthogonalize $\mathrm{\Psi}$.
Phase 3 \STATEEstimate the Hessian, ${\widehat{\mathscr{H}}}_{\theta}\mathrm{}{\rho}_{i}$ and ${\widehat{\mathscr{H}}}_{\vartheta}\mathrm{}{\widehat{\rho}}_{i}$, for each reward feature, ${\psi}_{i}$, $i\mathrm{=}\mathrm{1}\mathrm{,}\mathrm{\dots}\mathrm{}p$\STATEDiscard the reward feature having an indefinite Hessian; switch sign for those having positive semidefinite Hessian; and compute $t\mathrm{}{r}_{\theta \mathrm{,}i}\mathrm{=}t\mathrm{}r\mathrm{}\mathrm{(}{\widehat{\mathscr{H}}}_{\theta}\mathrm{}\mathrm{(}{\rho}_{i}\mathrm{)}\mathrm{)}$ and $t\mathrm{}{r}_{\vartheta \mathrm{,}i}\mathrm{=}t\mathrm{}r\mathrm{}\mathrm{(}{\widehat{\mathscr{H}}}_{\vartheta}\mathrm{}\mathrm{(}{\widehat{\rho}}_{i}\mathrm{)}\mathrm{)}$ for $i\mathrm{=}\mathrm{1}\mathrm{,}\mathrm{\cdots}\mathrm{,}p$ \STATEReward function ${R}_{\omega}\mathrm{=}\mathrm{\Psi}\mathrm{}{\mathbf{w}}_{\omega}$, ${\mathbf{w}}_{\omega}\mathrm{=}\mathrm{}\mathrm{(}\mathrm{1}\mathrm{/}\sqrt{\mathrm{2}}\mathrm{)}\mathrm{}\mathrm{(}{\text{\mathbf{t}\mathbf{r}}}_{\theta}\mathrm{/}\mathrm{\parallel}{\text{\mathbf{t}\mathbf{r}}}_{\theta}\mathrm{\parallel}\mathrm{+}{\text{\mathbf{t}\mathbf{r}}}_{\vartheta}\mathrm{/}\mathrm{\parallel}{\text{\mathbf{t}\mathbf{r}}}_{\vartheta}\mathrm{\parallel}\mathrm{)}$
Our algorithm consists of three phases. In the first phase, we obtain basis for Qfeatures space by solving linear equations. Linear equations consist of two parts. The first part is defined by the gradient of logarithmic policy and the second part is defined by the gradient of option termination. The matrices ${\mathrm{\Pi}}_{\mathrm{\Omega}}$ and $\mathrm{\Pi}$ are introduced to carry out computation for the second part. The matrix ${\mathrm{\Pi}}_{\mathrm{\Omega}}$ is the row repetition of policy over option, ${\pi}_{\mathrm{\Omega}}$, on visited option and state pair. The matrix $\mathrm{\Pi}$ is a block diagonal where each entry is intraoption policy over visited state and action pair for each option.
In the second phase, we obtain basis for rewardfeatures using reward shaping for option. In the last phase, we select the definite reward by applying Hessian test to two objective functions.
Our algorithm can be naturally extended to continuous states and action spaces. In the continuous domains we use a $k$nearest neighbors method to extend recovered reward functions to nonvisited stateaction pairs. Additional penalization terms can be included. Details about implementation are presented in section 7.2.
7 Experiment
7.1 Taxi
We test Option CRIRL in the Taxi domain defined in [4]. The environment is a 5×5 grid world (Figure 1) having four landmarks, $L=\{1,2,3,4\}$. The state vector, $s=(x,y,p,d)$, comprises a current position, $(x,y)$, of a taxi, a location, $p\in L\cup \{taxi\}$, of a passenger, and the passenger’s destination, $d\in L$. The possible actions are movements in four directions, including pickup and dropoff actions. In each episode, positions of taxi, passenger, and destination are assigned to random landmarks. The task of the driver is to pick up the passenger and drop him off at the destination. Depending on where the passenger or destination is located, each subtask involves navigating to one of the four landmarks. We define the option, ${\mathrm{\Omega}}_{1}=\{({I}_{\omega},{\pi}_{\omega},{\beta}_{\omega})\omega \in L\}$, to have each landmark subgoal, where
$$\{\begin{array}{c}{I}_{\omega}:\hspace{1em}p=\omega \text{or}(p=taxi\text{and}d=\omega )\hfill \\ {\pi}_{\omega}:\mathit{\hspace{1em}}\text{the policy for getting to the landmark}\omega \hfill \\ {\beta}_{\omega}:\hspace{1em}1\text{if}(x,y)\text{is the location of}\omega ;0\text{otherwise.}\hfill \end{array}$$ 
The intraoption policy, ${\pi}_{\omega}$, can be induced from the expert policy computed from the value iteration by duplicating the policy parameters relevant to its subgoal. The policy, ${\pi}_{\mathrm{\Omega}}$, over options is deterministic, because a subgoal can be directly specified from the current state. This userdefined option captures the hierarchical structure of the Taxi environment. On the other hand, we further evaluate our algorithm using option ${\mathrm{\Omega}}_{2}$, discovered by [2].
Each intraoption policy is parametrized as an $\u03f5$Boltzmann policy:
$${\pi}_{\theta ,\omega}(as)=(1\u03f5)\frac{{e}^{{\theta}_{\omega ,a}^{T}{\zeta}_{s}}}{{\sum}_{{a}^{\prime}\in A}{e}^{{\theta}_{\omega ,{a}^{\prime}}^{T}{\zeta}_{s}}}+\frac{\u03f5}{A},$$ 
where the policy features, ${\zeta}_{s}$, are the state features defined by current location, passenger location, destination location, and whether the passenger has been picked up. The noise, $\u03f5$, is included to evaluate the robustness to imperfect experts. Termination probability, ${\beta}_{\vartheta}$, is parametrized as a sigmoid function.
For comparison, we give weights to the optionwise reward function, ${R}_{\omega}(s,a)$, based on the policy over options:
$$R(s,a)=\sum _{\omega \in \mathrm{\Omega}}{\pi}_{\mathrm{\Omega}}(\omega s){R}_{\omega}(s,a).$$ 
It is easy to compare against other IRL algorithms by combining the rewards assigned to each option while the modified reward $R(s,a)$ maintains the nature of each task. We evaluate Option CRIRL against behavior cloning (BC), maximum entropy IRL (MEIRL) [24], and linear programming apprenticeship learning (LPAL) [1]. A natural choice for a reward feature in MEIRL and LPAL is the policy feature, ${\zeta}_{s}$, defined above. Figures 2 and 3 show the results of training a Boltzmann policy using REINFORCE, coped with the recovered reward function and the userdefined option, ${\mathrm{\Omega}}_{1}$, and the discovered option, ${\mathrm{\Omega}}_{2}$, respectively. Each result is averaged over 10 repetitions, and the error bars correspond to the standard deviation. We see that Option CRIRL converges faster to the optimal policy than does the original reward function and MEIRL when the userdefined option is used. When the discovered option is used, MEIRL often fails to learn the optimal policy. However, BC and LPAL are very sensitive to noise, whereas our algorithm is not significantly affected by a noise level. The noise robustness of our algorithm can be explained by the larger dimension of the feature space over nonoptional CRIRL algorithms. As explained at the end of Section 3, the dimension of the space of Qfeatures increases by factor, $\mathrm{\Omega}$, which is the size of options, and can absorb the change in defining the equation of Qfeatures.
7.2 Car on the Hill
We test Option CRIRL in the continuous CarontheHill domain [5]. A car traveling on a hill is required to reach the top of the hill. Here, the shape of the hill is given by the function, $Hill(p)$:
$$ 
The state space is continuous with dimension two: position $p$ and speed $v$ of the car with $p\in [1,1]$ and $v\in [3,3]$. The action $a\in [4,4]$ acts on the car’s acceleration. The reward function, $R(p,v,a)$, is defined as:
$$ 
The discount factor, $\gamma $, is 0.95, and the initial state is ${p}_{0}=0.5$ with ${v}_{0}=0$.
Because the car engine is not strong enough, simply accelerating up the slope cannot make it to the desired goal. The entire task can be divided into two subtasks: reaching enough speed at the bottom of the valley to leverage potential energy (subgoal 1), and driving to the top (subgoal 2). To evaluate our algorithm, we introduce handcrafted options:
$$\{\begin{array}{c}{I}_{\omega}:\mathit{\hspace{1em}}\text{the state space}S\hfill \\ {\pi}_{\omega}:\mathit{\hspace{1em}}\text{the policy for subgoal}\omega \hfill \\ {\beta}_{\omega}:\hspace{1em}1\text{if the agent achieves the subgoal};0\text{otherwise}\hfill \end{array}$$ 
for $\omega \in \{1,2\}$. Intraoption policy ${\pi}_{\omega}$ is defined by approximating the deterministic intraoption policies, ${\pi}_{\omega ,FQI}$, via the fittedQ iteration (FQI) [5] with the two corresponding small MDPs. We consider noisy intraoption policies in which a random action is selected with probability $\u03f5$:
$${\pi}_{\omega}(as)=(1\u03f5){\pi}_{\omega ,FQI}(as)+\u03f5{\pi}_{random}(as)$$ 
for each option, $\omega $. Each intraoption policy is parametrized as Gaussian policy ${\pi}_{\theta ,\omega}(as)\sim \mathcal{N}({y}_{\theta ,\omega}(s),{\sigma}^{2})$, where ${\sigma}^{2}$ is fixed to be 0.01, and ${y}_{\theta ,\omega}(s)$ is obtained using radial basis functions:
$${y}_{\theta ,\omega}(s)=\sum _{k=1}^{N}{\theta}_{\omega ,k}{e}^{\delta {\parallel s{s}_{k}\parallel}^{2}},$$ 
with uniform grids, ${s}_{k}$, in the state space. The parameter, ${\theta}_{\omega}$, is estimated using 20 expert trajectories for each option. Termination probability, ${\beta}_{\vartheta ,\omega}$, is parametrized as a sigmoid function.
For comparison, the taskwise reward function, ${R}_{\omega}(s,a)$, is merged into one reward, $R(s,a)$, by omitting the option term. This modification is possible, because the policyoveroptions is deterministic in our setting. The merged reward function, $R(s,a)$, can be compared with other reward functions using a nonhierarchical RL algorithm. We extend the recovered reward function to nonvisited stateaction pairs using a kernel $k$nearest neighbors (KNN) regression with a Gaussian kernel:
$${R}^{\text{nonpenalized}}(s,a)=\frac{{\sum}_{({s}^{\prime},{a}^{\prime})\in \text{KNN}((s,a),k,\mathcal{D})}\mathcal{K}((s,a),({s}^{\prime},{a}^{\prime}))R({s}^{\prime},{a}^{\prime})}{{\sum}_{({s}^{\prime},{a}^{\prime})\in \text{KNN}((s,a),k,\mathcal{D})}\mathcal{K}((s,a),({s}^{\prime},{a}^{\prime}))}$$ 
where $\text{KNN}((s,a),k,\mathcal{D})$ is the set of the $k$ nearest stateaction pairs in the demonstrations, $\mathcal{D}$, and $\mathcal{K}$ is a Gaussian kernel over $S\times A$:
$$\mathcal{K}((s,a),({s}^{\prime},{a}^{\prime}))=\mathrm{exp}\left(\frac{1}{2{\sigma}_{S}^{2}}{\parallel s{s}^{\prime}\parallel}^{2}\frac{1}{2{\sigma}_{A}^{2}}{\parallel a{a}^{\prime}\parallel}^{2}\right).$$ 
We also penalize the reward function for a nonvisited stateaction pairs far from the visited one. The penalization term is obtained using a KNN regression with a Gaussian kernel for a stateaction occupancy measure:
$${R}^{\text{penalized}}(s,a)=\alpha {\overline{R}}^{\text{nonpenalized}}(s,a)+(1\alpha )\overline{p}(s,a),$$ 
where ${\overline{R}}^{\text{nonpenalized}}$ is the scaled reward within the interval, $[0,1]$, and $\overline{p}$ is computed as:
$$\overline{p}(s,a)=\frac{{\sum}_{({s}^{\prime},{a}^{\prime})\in \text{KNN}((s,a),k,\mathcal{D})}\mathcal{K}((s,a),({s}^{\prime},{a}^{\prime}))}{{\mathrm{max}}_{({s}^{\prime \prime},{a}^{\prime \prime})\in \mathcal{D}}{\sum}_{({s}^{\prime},{a}^{\prime})\in \text{KNN}(({s}^{\prime \prime},{a}^{\prime \prime}),k,\mathcal{D})}\mathcal{K}(({s}^{\prime \prime},{a}^{\prime \prime}),({s}^{\prime},{a}^{\prime}))}.$$ 
These reward extensions and penalties are based on [17].
The recovered rewards are obtained from expert demonstrations with different levels of noise, $\u03f5$. We repeated the evaluation over 10 runs. As shown in Figure 4, FQI with the reward function outperforms the original reward in terms of convergence speed, regardless of noise level. When $\u03f5=0$, Option CRIRL converges to the optimal policy in only one iteration. As the noise level $\u03f5$ increases, BC yields worse performance, whereas Option CRIRL is still robust to noise.
Figure 5 displays the trajectories of the expert’s policy, the BC policy, and the policy computed via FQI with the reward recovered by Option CRIRL. When $\u03f5=0$, trajectories are almost overlapping. When $\u03f5$ increases, BC requires more steps to reach to the termination state, and some cannot finish the task properly. On the other hand, we see that our reward function can recover the optimal policy, even if expert demonstrations are not close to optimal.
8 Discussion
We developed a modelfree IRL algorithm for hierarchical tasks modeled in the options framework. Our algorithm, Option CRIRL, extracts reward features using firstorder optimality conditions based on the gradient for intraoption policies and termination functions. Then, it constructs taskwise reward functions from the extracted reward spaces using a secondorder optimality condition. The recovered reward functions explain the expert’s behavior and the underlying hierarchical structure.
Most IRL algorithms require handcrafted reward features, which are crucial to the quality of recovered reward functions. Our algorithm directly builds the approximate space of the reward function from expert demonstrations. Additionally, unlike other IRL methods, our algorithm does not require solving a forward problem as an inner step.
Some heuristic methods were used to solve the multiobjective optimization problem in the reward selection step. We used the weighted solution obtained from two separate singleobjective optimization problems, empirically finding that any combination of weights resulted in good performances. Generally, depending on the type of option used, one of parameters of intraoption policies or termination functions could be more sensitive than the other. Therefore, the choice of weights can make a difference in the final performance. Additionally, we tested the linear scalarization approach, and our algorithm performed well, except for the case of the Taxi domain with the userdefined option, ${\mathrm{\Omega}}_{1}$. In this case, we found that two trace vectors computed with the policies and terminations were too different in magnitude. Thus, the alternative approach was inevitable.
Our algorithm was validated in several classical benchmark domains, but to apply it to realworld problems, we need to experiment with more complex environments. More sophisticated options should be used to better explain the complex nature of a hierarchical task, making experiment extensions easier.
9 Acknowledgement
Hyung Ju Hwang was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF2017R1E1A1A03070105, NRF2019R1A5A1028324) and by the Institute for Information & Communications Technology Promotion (IITP) under the ITRC (Information Technology Research Center) support program (IITP2018001441).
References
 [1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twentyfirst international conference on Machine learning, page 1. ACM, 2004.
 [2] PierreLuc Bacon, Jean Harb, and Doina Precup. The optioncritic architecture. In ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
 [3] Jaedeug Choi and KeeEung Kim. Nonparametric bayesian inverse reinforcement learning for multiple reward functions. In Advances in Neural Information Processing Systems, pages 305–313, 2012.
 [4] Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000.
 [5] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Treebased batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.
 [6] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pages 49–58, 2016.
 [7] Roy Fox, Sanjay Krishnan, Ion Stoica, and Ken Goldberg. Multilevel discovery of deep options. arXiv preprint arXiv:1703.08294, 2017.
 [8] Thomas Furmston and David Barber. A unifying perspective of parametric policy search methods for markov decision processes. In Advances in neural information processing systems, pages 2717–2725, 2012.
 [9] Peter Henderson, WeiDi Chang, PierreLuc Bacon, David Meger, Joelle Pineau, and Doina Precup. Optiongan: Learning joint rewardpolicy options using generative adversarial inverse reinforcement learning. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 [10] Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems, pages 1531–1538, 2002.
 [11] George Konidaris, Scott Kuindersma, Roderic Grupen, and Andrew Barto. Robot learning from demonstration by constructing skill trees. The International Journal of Robotics Research, 31(3):360–375, 2012.
 [12] Sanjay Krishnan, Animesh Garg, Richard Liaw, Lauren Miller, Florian T Pokorny, and Ken Goldberg. Hirl: Hierarchical inverse reinforcement learning for longhorizon tasks with delayed rewards. arXiv preprint arXiv:1604.06508, 2016.
 [13] Sanjay Krishnan, Animesh Garg, Richard Liaw, Brijen Thananjeyan, Lauren Miller, Florian T Pokorny, and Ken Goldberg. Swirl: A sequential windowed inverse reinforcement learning algorithm for robot tasks with delayed rewards. The International Journal of Robotics Research, 38(23):126–145, 2019.
 [14] Sanjay Krishnan, Animesh Garg, Sachin Patil, Colin Lea, Gregory Hager, Pieter Abbeel, and Ken Goldberg. Transition state clustering: Unsupervised surgical trajectory segmentation for robot learning. In Robotics Research, pages 91–110. Springer, 2018.
 [15] Sergey Levine, Zoran Popovic, and Vladlen Koltun. Nonlinear inverse reinforcement learning with gaussian processes. In Advances in Neural Information Processing Systems, pages 19–27, 2011.
 [16] Richard Liaw, Sanjay Krishnan, Animesh Garg, Daniel Crankshaw, Joseph E Gonzalez, and Ken Goldberg. Composing metapolicies for autonomous driving using hierarchical deep reinforcement learning. arXiv preprint arXiv:1711.01503, 2017.
 [17] Alberto Maria Metelli, Matteo Pirotta, and Marcello Restelli. Compatible reward inverse reinforcement learning. In Advances in Neural Information Processing Systems, pages 2050–2059, 2017.
 [18] Bernard Michini, Thomas Walsh, Aliakbar Aghamohammadi, and Jonathan P How. Bayesian nonparametric reward learning from demonstration. IEEE transactions on Robotics, 31:369–386, 2015.
 [19] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pages 278–287, 1999.
 [20] Pierre Sermanet, Kelvin Xu, and Sergey Levine. Unsupervised perceptual rewards for imitation learning. arXiv preprint arXiv:1612.06699, 2016.
 [21] Alec Solway, Carlos Diuk, Natalia Córdova, Debbie Yee, Andrew G Barto, Yael Niv, and Matthew M Botvinick. Optimal behavioral hierarchy. PLoS computational biology, 10(8):e1003779, 2014.
 [22] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 [23] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211, 1999.
 [24] Brian D Ziebart, Andrew Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In TwentyThird AAAI Conference on Artificial Intelligence, volume 8, pages 1433–1438, 2008.