Option Compatible Reward Inverse Reinforcement Learning

  • 2019-11-07 02:29:58
  • Rakhoon Hwang, Hanjin Lee, Hyung Ju Hwang
  • 1

Abstract

Reinforcement learning with complex tasks is a challenging problem. Often,expert demonstrations of complex multitasking operations are required to trainagents. However, it is difficult to design a reward function for given complextasks. In this paper, we solve a hierarchical inverse reinforcement learning(IRL) problem within the framework of options. A gradient method forparametrized options is used to deduce a defining equation for the Q-featurespace, which leads to a reward feature space. Using a second-order optimalitycondition for option parameters, an optimal reward function is selected.Experimental results in both discrete and continuous domains confirm that oursegmented rewards provide a solution to the IRL problem for multitaskingoperations and show good performance and robustness against the noise createdby expert demonstrations.

 

Quick Read (beta)

Option Compatible Reward Inverse Reinforcement Learning

Rakhoon Hwang Department of Mathematics, Pohang University of Science and Technology (POSTECH), Pohang, Republic of Korea Hanjin Lee Global leadership school, Handong global university, Pohang, Republic of Korea Hyung Ju Hwang11 1 Correponding address: Department of Mathematics, Pohang University of Science and Technology (POSTECH), 37673, 77, Cheongam-ro, Nam-gu, Pohang, Gyeongsangbuk-do, Republic of Korea
E-mail address: [email protected]
Department of Mathematics, Pohang University of Science and Technology (POSTECH), Pohang, Republic of Korea
Abstract

Reinforcement learning with complex tasks is a challenging problem. Often, expert demonstrations of complex multitasking operations are required to train agents. However, it is difficult to design a reward function for given complex tasks. In this paper, we solve a hierarchical inverse reinforcement learning (IRL) problem within the framework of options. A gradient method for parametrized options is used to deduce a defining equation for the Q-feature space, which leads to a reward feature space. Using a second-order optimality condition for option parameters, an optimal reward function is selected. Experimental results in both discrete and continuous domains confirm that our segmented rewards provide a solution to the IRL problem for multitasking operations and show good performance and robustness against the noise created by expert demonstrations.

1 Introduction

The reinforcement-learning (RL) method seeks an optimal policy for a given reward function in a Markov decision process (MDP). There are several circumstances in which an agent can learn only from an expert demonstration, because it is difficult to prescribe a proper reward function for a given task. Inverse reinforcement learning (IRL) aims to find a reward function for which the expert’s policy is optimized. When the IRL method is applied to a large domain, the size of each trajectory of the required demonstration by the expert can be huge. There are also certain complex tasks that must be segmented into a sequence of sub-tasks (e.g., robotics of ubiquitous general-purpose automation ([11] [13]), robotic surgical procedure training ([7], [14]), hierarchical human behavior modeling [21], and autonomous driving [16]). For such complex tasks, a problem designer can decompose it hierarchically. Then an expert can easily demonstrate it at different levels of implementation.

Another challenge with the IRL method is the design of feature spaces that capture the structure of the reward functions. Linear models for reward functions have been used in existing IRL algorithms. However, nonlinear models have recently been introduced [15], [6], [17]. Exploring more general feature spaces for reward functions becomes necessary when expert intuition is insufficient for designing good features, including linear models. This problem raises concerns, such as in the robotics field [20].

Regarding the first aspect of our problem, several works considered the decomposition of underlying reward functions for given expert demonstrations in RL and IRL problems ([9], [3], [12]). For hierarchical IRL problems, most of works focus on how to perform segmentation on demonstrations of complex tasks and find suitable reward functions. For the IRL problem in options framework, option discovery should be first carried out as a segmentation process. Since our work focuses on hierarchical extensions of policy gradient based IRL algorithms, we assign options for each given domain instead of applying certain option discovery algorithms.

To simultaneously resolve the problems of segmentation and feature construction , we propose a new method that applies the option framework presented by [2] to the compatible reward inverse reinforcement learning (CR-IRL) [17], a recent work on generating a feature space of rewards. Our method is called Option CR-IRL. Previous works on the selection of proper reward functions for the IRL problem require design features that consider the environment of the problem. However, the CR-IRL algorithm directly provides a space of features from which compatible reward functions can be constructed.

The main contribution of our work comprises the following items.

  • New method of assigning task-wise reward functions for a hierarchical IRL problem is introduced. Although both OptionGAN [9] and our work use policy gradient methods as a common grounding component, the former work adopts GAN approach to solve the IRL problem while we construct an explicit defining equation for reward features.

  • For a multitask IRL problem, we simultaneously obtain feature spaces for sub-tasks. When a task-wise reward function is constructed by combining obtained features, they share the same parameters for combination. As a consequence, we can avoid the problem of task-wise feature design and perform a one-shot process of optimal parameter selection across the domain of entire task.

  • Our optional framework approach for solving the IRL problem in a multitask environment outperforms other algorithms that do not consider the multitask nature of the problem. While handling the termination of each subtask, introducing parameters to termination and subtask policy functions in the policy gradient framework allows us to perform fine tuning on subtask reward functions, providing a better reward selection. It also shows better robustness to noise created by expert policy than other algorithms without using a hierarchical learning framework. The noise robustness of our algorithm is enabled by a larger dimension of the feature space than that which is available to non-optional CR-IRL algorithms.

There are differences in several aspects between our work and some of recent works [9], [18] and [12] on segmentation of reward functions in IRL problems. [18] uses Bayesian nonparametric mixture models to simultaneously partition the demonstration and learn associated reward functions. It has an advantage in the case with domains in which subgoals of each subtask are definite. For such domains, a successful segmentation simply defines task-wise reward functions. However, our work allows for indefiniteness of subgoals for which an assignment of rewards is not simple. [12] focuses on segmentation using transitions defined as changes in local linearity about a kernel function. It assumes pre-designed features for reward functions. On the other hand, our method does not assume any pre-knowledge on feature spaces.

2 Preliminaries

Markov decision process The Markov decision process comprises the state space, S, the action space, A, the transition function, P:S×A(S[0,1]), and the reward function, R:S×A. A policy is a probability distribution, π:S×A[0,1], over actions conditioned on the states. The value of a policy is defined as Vπ(s)=𝔼π[t=0γtRt+1|s0=s], and the action-value function is Qπ(s,a)=𝔼π[t=0γtRt+1|s0=s,a0=a], where γ[0,1) is the discounted factor.

Policy Gradients Policy gradient methods [22] aim to optimize a parametrized policy, πθ, via the stochastic gradient ascent. In a discounted setting, the optimization of the expected γ-discounted return, ρ(θ,s0)=𝔼πθ[t=0γtRst|s0], is considered. It can be written as

ρ(θ,s0)=Sμπθ(s|s0)Aπθ(a|s)R(s|a)𝑑a𝑑s

where μπθ(s|s0)=t=0γtP(st=s|s0,πθ). The policy gradient theorem [22] states:

θρ(θ,s0)=SAμπθ(s|s0)θπθ(a|s)Qπθ(s,a)𝑑a𝑑s.

CR-IRL The CR-IRL method is an algorithm that generates a set of base functions spanning the subspace of reward functions that cause the policy gradient to vanish. As input, a parametric policy space, ΠΘ={πθ:θΘk}, and a set of trajectories from the expert policy, πE, are taken. It first builds the features, {ϕi}, of the action-value function, which cause the policy gradient to vanish. These features can be transformed into reward features, {ψi}, via the Bellman equation (model-based case) or reward-shaping [19](model-free). Then, a reward function that maximizes the expected return is chosen by enforcing a second-order optimality condition based on the policy Hessian [10], [8].

The options framework We use the options framework for the hierarchical learning. See [23] for details. A Markovian option, ωΩ, is a triple (Iω,πω,βω), where Iω is an initiation set, πω is an intra-option policy, and βω:S[0,1] is a termination function. Following [2], we consider the call-and-return option execution model in which a fixed trajectory of options is given. Let πω,θ denote the intra-option policy of option ω parametrized by θ and βω,ϑ, the termination function of the same option parametrized by ϑ.

Option-critic architecture [2] proposed a method of option discovery based on gradient descent applied to the expected discounted return, defined by

ρ(Ω,θ,ϑ,s0,ω0)=EΩ,θ,ϑ[t=0γtRt+1|s0,ω0].

The objective function used here depends on policy over options and the parameters for intra-option policies and termination functions. Its gradient with respect to these parameters is taken through the following equations: the option-value function can be written as

QΩ(s,ω)=aπω,θ(a|s)QU(s,ω,a)

where

QU(s,ω,a)=R(s,a)+γsP(s|s,a)U(ω,s)

is the action-value function for the state-option pair, and

U(ω,s)=(1-βω,ϑ(s))QΩ(s,ω)+βω,ϑ(s)VΩ(s)

is the option-value function upon arrival.

3 Generation of Q-features compatible with the optimal policy

The first step to obtain a reward function as a solution for a given IRL problem is to generate Q-features compatible with an expert policy using the gradient of expected discounted returns. We assume that the parametrized expert intra-option policies, πω,θ, are differentiable with respect to θ. By the intra-option policy gradient theorem [2], the gradient of the expected discounted return with respect to θ vanishes as in the following equation:

θρ=s,ωμΩ(s,ω|s0,ω0)aθπω,θ(a|s)QU(s,ω,a)=0 (1)

where μΩ(s,ω|s0,ω0) is the occupancy measure of state-option pairs.

The first-order optimality condition, θρ=0, gives a defining equation for Q-features compatible with the optimal policy. It is convenient to define a subspace of such compatible Q-features in the Hilbert space of functions on Ω×S×A. We define the inner product:

<f,gω,s,af(ω,s,a)μΩ(s,ω|s0,ω0)πω,θ(a|s)g(ω,s,a).

Consider the subspace, Gπ={θlogπω,θα:αk}, of the Hilbert space of functions on Ω×S×A with the inner product defined above. Then, the space of Q-features can be represented by the orthogonal complement, Gπ of Gπ.

Parametrization of terminations is expected to allow us to have more finely tuned option-wise reward functions in IRL problems. We can impose an additional optimality condition on the expected discounted return with respect to parameters of the termination function. Let

ρ^(Ω,θ,ϑ,s1,ω0)=EΩ,θ,ϑ[t=0γtRt+1|ω0,s1]

be the expected discounted return with initial condition (s1,ω0). By the termination gradient theorem [2], one has

ϑρ^=-s,ωμΩ(s,ω|s1,ω0)ϑβω,ϑ(s)AΩ(s,ω) (2)

where AΩ is the advantage function over options AΩ(s,ω)=QΩ(s,ω)-VΩ(s).

The vanishing equation (2) gives a constraint on the space of the Q-feature, Gπ. For simplicity, set μ1,Ω(s,ω)=μΩ(s,ω|s1,ω0). The constraint equation for Gπ is given by

ω,sϑβω,ϑ(s)μ1,Ω(s,ω)(QΩ(s,ω)-ωπΩ(ω|s)QΩ(s,ω))=0 (3)

where

QΩ(s,ω)=aπω,θ(a|s)QU(s,ω,a).

Thus, we can combine two linear equations (1), (3) for QU to define the space of Q-features.

Let k be the size of parameters θ and let p be the size of parameters ϑ. Because aπω,θ(a|s)=1 for each pair (s,ω), the rank of the equation (1) is not greater than min{k,|S||A||Ω|-|S||Ω|}, and the rank of the equation (3) is not greater than min{p,|S||A||Ω|-|S||Ω|}. Thus, the dimension of the space of Q-features defined by (1) and (3) is not less than max{|S||Ω|,|S||A||Ω|-(k+p)}.

4 Reward function from Q-functions

For the MDP model, if two reward functions share the same optimal policy, then they satisfy the following([19]):

R(s,a)=R(s,a)+γsP(s|s,a)χ(s)-χ(s).

If we take χ=V, then

R(s,a)=Q(s,a)-V(s)=Q(s,a)-aπ(a|s)Q(s,a).

Because the Q-value function depends on the option in the options framework, the potential function, χ, also depends on the option. We thus need to consider reward-shaping with regards to the intra-option policy, πω. Then, the reward functions also need to be defined in the intra-option sense. Thus, reward functions also depend on options. This viewpoint is essential to our work and is similar to the approach taken in [9], in which Rω, the reward option, was introduced corresponding to the intra-option policy, πω. Reward functions, Rω,Rω, sharing the same intra-option policy, πω, satisfy

Rω(s,a)=Rω(s,a)+γsP(s|s,a)χ(s,ω)-χ(s,ω).

If we take χ(s,ω)=U(ω,s), then

Rω(s,a) = Rω(s,a)+γsP(s|s,a)U(ω,s)-U(ω,s)
= QU(s,ω,a)-[(1-β(s))QΩ(ω,s)+β(s)VΩ(s)]
= QU(s,ω,a)-aπω(a|s)QU(s,ω,a)+β(s)AΩ(s,ω)

This provides us with a way to generate reward functions from Q-features in the options framework.

5 Reward selection via the second-order optimality condition

Among the linear combinations of reward features constructed in the previous section, selecting a linear combination that maximizes ρ(θ) and ρ^(ϑ) is required. For the purpose of optimization, we use the second-order optimality condition based on the Hessian of ρ(θ) and ρ^(ϑ).

Consider a trajectory, τ=((s0,ω0,a0,b0),,(sT-1,ωT-1,aT-1,bT-1)), with termination indicator bt and terminal state sT. The termination indicator, bt, is 1 if a previous option terminates at step t. Otherwise it is 0. The probability density of trajectory τ is given by

θ,ϑ(τ)=p0(s0)δb0=1πΩ(ω0|s0)t=1T-1(bt,ωt|ωt-1,st)t=0T-1πωt(at|st)p(st+1|st,at),

where

(bt=1,ωt|ωt-1,st) =βωt-1(st)πΩ(ωt|st)
(bt=0,ωt|ωt-1,st) =(1-βωt-1(st))δωt=ωt-1.

We denote the set of all possible trajectories by 𝕋 and the γ-discounted trajectory reward by R(τ)=t=0T(τ)γtR(sτ,t,aτ,t). Then, the objective function can be rewritten as

ρ(Ω,θ,ϑ,s0,ω0)=E[t=0γtRt+1|s0,ω0]=𝕋θ,ϑ(τ)R(τ)𝑑τ.

Its gradient and Hessian with respect to θ can be expressed as

θρ=𝕋θ,ϑ(τ)θlogθ,ϑ(τ)R(τ)𝑑τ

and

θρ=𝕋θ,ϑ(τ)(θlogθ,ϑ(τ)θlogθ,ϑ(τ)T+θlogθ,ϑ(τ))R(τ)𝑑τ.

The second objective function can be written as

ρ^=E[t=0γtRt+1|ω0,s1]=𝕋^^θ,ϑ(τ^)R(τ^)𝑑τ^,

where τ^ is a trajectory beginning with (ω0,s1) with the probability distribution

^θ,ϑ(τ^)=p0(ω0,s1)t=1T-1(bt,ωt|ωt-1,st)t=1T-1πωt(at|st)p(st+1|st,at).

Then, its Hessian can be written as

ϑρ^=𝕋^^θ,ϑ(τ^)(ϑlog^θ,ϑ(τ^)ϑlog^θ,ϑ(τ^)T+ϑlog^θ,ϑ(τ^))R(τ^)𝑑τ^.

Let {ψω,i} be the reward features constructed from the previous section. Rewrite each Hessian as θ(ρ)=iwiθρi, where ρi is the expected return with respect to θ,ϑ for the reward function, ψi, and as ϑ(ρ^)=iwiϑρ^i, where ρ^i is the expected return with respect to ^θ,ϑ for the reward function, ψi. Set trθ,i=tr(θ(ρi)) and trϑ,i=tr(ϑ(ρ^i)) for i=1,,p.

We want to determine the reward weight, 𝐰, for the reward function, Rω=i=1pwiψω,i, which yields a negative definite Hessian with a minimal trace. Geometrically, this corresponds to choosing a reward function that makes the expected returns locally the sharpest maximum points. Here, to relieve a computational burden, we exploit a heuristic method suggested by [17].

Thus, we only choose reward features having negative definite Hessians, compute the trace of each Hessian, and collect them in the vectors 𝐭𝐫θ=(trθ,i) and 𝐭𝐫ϑ=(trϑ,i). We determine w by solving

min𝐰𝐰T𝐭𝐫θ, andmin𝐰𝐰T𝐭𝐫ϑ  s. t.   𝐰22=1.

Typically, multi-objective optimization problems have no single solutions that optimize all objective functions simultaneously. One well-known approach to tackling this problem is a linear scalarization. Thus, we consider the following single-objective problem:

min𝐰λθ𝐰T𝐭𝐫θ+λϑ𝐰T𝐭𝐫ϑ  s. t.   𝐰22=1

with positive weights λθ and λϑ. A closed-form solution is computed as 𝐰=-(λθ𝐰T𝐭𝐫θ+λϑ𝐰T𝐭𝐫ϑ)/λθ𝐰T𝐭𝐫θ+λϑ𝐰T𝐭𝐫ϑ. With a different choice of scalarization weights, λθ and λϑ, different reward functions can be produced. In practice, it is difficult to choose a proper weight pair, because the gap between the magnitudes of two trace vectors can be too large. Alternatively, we consider a weighted sum of solutions to the two single-objective optimization problems: 𝐰=αθ(-𝐭𝐫θ/𝐭𝐫θ)+αϑ(-𝐭𝐫ϑ/𝐭𝐫ϑ), satisfying αθ,αϑ>0 and αθ2+αϑ2=1. Thus, we cannot guarantee the obtained solution is Pareto optimal.

6 Algorithm

We summarize our algorithm of solving the IRL problem in the options framework as follows:

{algorithm}

Option Compatible Reward IRL \[email protected]@algorithmic[1]

\REQUIRE

(1) Expert’s demo-trajectories D=i=1N{(ωτi,1,sτi,0,aτi,0),(ωτi,1,sτi,1,aτi,1),,(ωτi,T(τi),sτi,T(τi),aτi,T(τi)}, (2) Option Ωθ,ϑ={ω}, for which expert’s policies, πω,θ, and termination functions, βω,ϑ, are parametrized, and a policy πΩ over options. \ENSUREReward function Rω(s,a).

Phase 1 \STATEEstimate dμπθ(ω,s,a)=μΩ(s,ω|s0,ω0)πω,θ(a|s) for the visited state-action pairs. \STATEGet the |D|×|D| matrix Dμπθ= diag (dμπθ) and k×|D| matrix θlogπω(a|s). \STATECompute the null space of θlogπωTDμπθ. \STATEEstimate μ1=μΩ(s,ω|s1,ω0) for the visited state-action pairs. \STATEGet the matrices, diag(μ1), ϑβ, ΠΩ, and Π. \STATECompute the null space of ϑβTdiag(μ1)(I-ΠΩ)Π. \STATEFind the intersection, Φ, of two null spaces. \STATEGet the set of advantage functions using A=(I-ΠΩ)ΠΦ.

Phase 2 \STATEGet the set of reward functions by applying reward shaping Ψ=(I-Π)Φ+βA. \STATEApply singular value decomposition to orthogonalize Ψ.

Phase 3 \STATEEstimate the Hessian, ^θρi and ^ϑρ^i, for each reward feature, ψi, i=1,p\STATEDiscard the reward feature having an indefinite Hessian; switch sign for those having positive semi-definite Hessian; and compute trθ,i=tr(^θ(ρi)) and trϑ,i=tr(^ϑ(ρ^i)) for i=1,,p \STATEReward function Rω=Ψ𝐰ω, 𝐰ω=-(1/2)(𝐭𝐫θ/𝐭𝐫θ+𝐭𝐫ϑ/𝐭𝐫ϑ)

Our algorithm consists of three phases. In the first phase, we obtain basis for Q-features space by solving linear equations. Linear equations consist of two parts. The first part is defined by the gradient of logarithmic policy and the second part is defined by the gradient of option termination. The matrices ΠΩ and Π are introduced to carry out computation for the second part. The matrix ΠΩ is the row repetition of policy over option, πΩ, on visited option and state pair. The matrix Π is a block diagonal where each entry is intra-option policy over visited state and action pair for each option.

In the second phase, we obtain basis for reward-features using reward shaping for option. In the last phase, we select the definite reward by applying Hessian test to two objective functions.

Our algorithm can be naturally extended to continuous states and action spaces. In the continuous domains we use a k-nearest neighbors method to extend recovered reward functions to non-visited state-action pairs. Additional penalization terms can be included. Details about implementation are presented in section 7.2.

7 Experiment

7.1 Taxi

Figure 1: Taxi domain

We test Option CR-IRL in the Taxi domain defined in [4]. The environment is a 5×5 grid world (Figure 1) having four landmarks, L={1,2,3,4}. The state vector, s=(x,y,p,d), comprises a current position, (x,y), of a taxi, a location, pL{taxi}, of a passenger, and the passenger’s destination, dL. The possible actions are movements in four directions, including pick-up and drop-off actions. In each episode, positions of taxi, passenger, and destination are assigned to random landmarks. The task of the driver is to pick up the passenger and drop him off at the destination. Depending on where the passenger or destination is located, each subtask involves navigating to one of the four landmarks. We define the option, Ω1={(Iω,πω,βω)|ωL}, to have each landmark subgoal, where

{Iω:p=ωor(p=taxiandd=ω)πω:the policy for getting to the landmarkωβω:1if(x,y)is the location ofω;0otherwise.

The intra-option policy, πω, can be induced from the expert policy computed from the value iteration by duplicating the policy parameters relevant to its subgoal. The policy, πΩ, over options is deterministic, because a subgoal can be directly specified from the current state. This user-defined option captures the hierarchical structure of the Taxi environment. On the other hand, we further evaluate our algorithm using option Ω2, discovered by [2].

Each intra-option policy is parametrized as an ϵ-Boltzmann policy:

πθ,ω(a|s)=(1-ϵ)eθω,aTζsaAeθω,aTζs+ϵ|A|,

where the policy features, ζs, are the state features defined by current location, passenger location, destination location, and whether the passenger has been picked up. The noise, ϵ, is included to evaluate the robustness to imperfect experts. Termination probability, βϑ, is parametrized as a sigmoid function.

For comparison, we give weights to the option-wise reward function, Rω(s,a), based on the policy over options:

R(s,a)=ωΩπΩ(ω|s)Rω(s,a).

It is easy to compare against other IRL algorithms by combining the rewards assigned to each option while the modified reward R(s,a) maintains the nature of each task. We evaluate Option CR-IRL against behavior cloning (BC), maximum entropy IRL (ME-IRL) [24], and linear programming apprenticeship learning (LPAL) [1]. A natural choice for a reward feature in ME-IRL and LPAL is the policy feature, ζs, defined above. Figures 2 and 3 show the results of training a Boltzmann policy using REINFORCE, coped with the recovered reward function and the user-defined option, Ω1, and the discovered option, Ω2, respectively. Each result is averaged over 10 repetitions, and the error bars correspond to the standard deviation. We see that Option CR-IRL converges faster to the optimal policy than does the original reward function and ME-IRL when the user-defined option is used. When the discovered option is used, ME-IRL often fails to learn the optimal policy. However, BC and LPAL are very sensitive to noise, whereas our algorithm is not significantly affected by a noise level. The noise robustness of our algorithm can be explained by the larger dimension of the feature space over non-optional CR-IRL algorithms. As explained at the end of Section 3, the dimension of the space of Q-features increases by factor, |Ω|, which is the size of options, and can absorb the change in defining the equation of Q-features.

Figure 2: Average return of Taxi domain as a function of the number of iterations of REINFORCE, for the user-defined option, Ω1
Figure 3: Average return of Taxi domain as a function of the number of iterations of REINFORCE, for the discovered option, Ω2

7.2 Car on the Hill

We test Option CR-IRL in the continuous Car-on-the-Hill domain [5]. A car traveling on a hill is required to reach the top of the hill. Here, the shape of the hill is given by the function, Hill(p):

Hill(p)={p2+pifp<0p1+5p2ifp0.

The state space is continuous with dimension two: position p and speed v of the car with p[-1,1] and v[-3,3]. The action a[-4,4] acts on the car’s acceleration. The reward function, R(p,v,a), is defined as:

R(pt,vt,at)={-1ifpt+1<-1or|vt+1|>31ifpt+1>1and|vt+1|30otherwise.

The discount factor, γ, is 0.95, and the initial state is p0=-0.5 with v0=0.

Because the car engine is not strong enough, simply accelerating up the slope cannot make it to the desired goal. The entire task can be divided into two subtasks: reaching enough speed at the bottom of the valley to leverage potential energy (subgoal 1), and driving to the top (subgoal 2). To evaluate our algorithm, we introduce hand-crafted options:

{Iω:the state spaceSπω:the policy for subgoalωβω:1if the agent achieves the subgoal;0otherwise

for ω{1,2}. Intra-option policy πω is defined by approximating the deterministic intra-option policies, πω,FQI, via the fitted-Q iteration (FQI) [5] with the two corresponding small MDPs. We consider noisy intra-option policies in which a random action is selected with probability ϵ:

πω(a|s)=(1-ϵ)πω,FQI(a|s)+ϵπrandom(a|s)

for each option, ω. Each intra-option policy is parametrized as Gaussian policy πθ,ω(a|s)𝒩(yθ,ω(s),σ2), where σ2 is fixed to be 0.01, and yθ,ω(s) is obtained using radial basis functions:

yθ,ω(s)=k=1Nθω,ke-δs-sk2,

with uniform grids, sk, in the state space. The parameter, θω, is estimated using 20 expert trajectories for each option. Termination probability, βϑ,ω, is parametrized as a sigmoid function.

For comparison, the task-wise reward function, Rω(s,a), is merged into one reward, R(s,a), by omitting the option term. This modification is possible, because the policy-over-options is deterministic in our setting. The merged reward function, R(s,a), can be compared with other reward functions using a non-hierarchical RL algorithm. We extend the recovered reward function to non-visited state-action pairs using a kernel k-nearest neighbors (KNN) regression with a Gaussian kernel:

Rnon-penalized(s,a)=(s,a)KNN((s,a),k,𝒟)𝒦((s,a),(s,a))R(s,a)(s,a)KNN((s,a),k,𝒟)𝒦((s,a),(s,a))

where KNN((s,a),k,𝒟) is the set of the k nearest state-action pairs in the demonstrations, 𝒟, and 𝒦 is a Gaussian kernel over S×A:

𝒦((s,a),(s,a))=exp(-12σS2s-s2-12σA2a-a2).

We also penalize the reward function for a non-visited state-action pairs far from the visited one. The penalization term is obtained using a KNN regression with a Gaussian kernel for a state-action occupancy measure:

Rpenalized(s,a)=αR¯non-penalized(s,a)+(1-α)p¯(s,a),

where R¯non-penalized is the scaled reward within the interval, [0,1], and p¯ is computed as:

p¯(s,a)=(s,a)KNN((s,a),k,𝒟)𝒦((s,a),(s,a))max(s′′,a′′)𝒟(s,a)KNN((s′′,a′′),k,𝒟)𝒦((s′′,a′′),(s,a)).

These reward extensions and penalties are based on [17].

The recovered rewards are obtained from expert demonstrations with different levels of noise, ϵ. We repeated the evaluation over 10 runs. As shown in Figure 4, FQI with the reward function outperforms the original reward in terms of convergence speed, regardless of noise level. When ϵ=0, Option CR-IRL converges to the optimal policy in only one iteration. As the noise level ϵ increases, BC yields worse performance, whereas Option CR-IRL is still robust to noise.

Figure 5 displays the trajectories of the expert’s policy, the BC policy, and the policy computed via FQI with the reward recovered by Option CR-IRL. When ϵ=0, trajectories are almost overlapping. When ϵ increases, BC requires more steps to reach to the termination state, and some cannot finish the task properly. On the other hand, we see that our reward function can recover the optimal policy, even if expert demonstrations are not close to optimal.

Figure 4: Average return of Car-on-the-Hill domain as a function of the number of FQI iterations.
Figure 5: Trajectories of the expert’s policy, the BC policy, and the policy computed via FQI with the reward recovered by Option CR-IRL.

8 Discussion

We developed a model-free IRL algorithm for hierarchical tasks modeled in the options framework. Our algorithm, Option CR-IRL, extracts reward features using first-order optimality conditions based on the gradient for intra-option policies and termination functions. Then, it constructs task-wise reward functions from the extracted reward spaces using a second-order optimality condition. The recovered reward functions explain the expert’s behavior and the underlying hierarchical structure.

Most IRL algorithms require hand-crafted reward features, which are crucial to the quality of recovered reward functions. Our algorithm directly builds the approximate space of the reward function from expert demonstrations. Additionally, unlike other IRL methods, our algorithm does not require solving a forward problem as an inner step.

Some heuristic methods were used to solve the multi-objective optimization problem in the reward selection step. We used the weighted solution obtained from two separate single-objective optimization problems, empirically finding that any combination of weights resulted in good performances. Generally, depending on the type of option used, one of parameters of intra-option policies or termination functions could be more sensitive than the other. Therefore, the choice of weights can make a difference in the final performance. Additionally, we tested the linear scalarization approach, and our algorithm performed well, except for the case of the Taxi domain with the user-defined option, Ω1. In this case, we found that two trace vectors computed with the policies and terminations were too different in magnitude. Thus, the alternative approach was inevitable.

Our algorithm was validated in several classical benchmark domains, but to apply it to real-world problems, we need to experiment with more complex environments. More sophisticated options should be used to better explain the complex nature of a hierarchical task, making experiment extensions easier.

9 Acknowledgement

Hyung Ju Hwang was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF-2017R1E1A1A03070105, NRF-2019R1A5A1028324) and by the Institute for Information & Communications Technology Promotion (IITP) under the ITRC (Information Technology Research Center) support program (IITP-2018-0-01441).

References

  • [1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 1. ACM, 2004.
  • [2] Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  • [3] Jaedeug Choi and Kee-Eung Kim. Nonparametric bayesian inverse reinforcement learning for multiple reward functions. In Advances in Neural Information Processing Systems, pages 305–313, 2012.
  • [4] Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000.
  • [5] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.
  • [6] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pages 49–58, 2016.
  • [7] Roy Fox, Sanjay Krishnan, Ion Stoica, and Ken Goldberg. Multi-level discovery of deep options. arXiv preprint arXiv:1703.08294, 2017.
  • [8] Thomas Furmston and David Barber. A unifying perspective of parametric policy search methods for markov decision processes. In Advances in neural information processing systems, pages 2717–2725, 2012.
  • [9] Peter Henderson, Wei-Di Chang, Pierre-Luc Bacon, David Meger, Joelle Pineau, and Doina Precup. Optiongan: Learning joint reward-policy options using generative adversarial inverse reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • [10] Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems, pages 1531–1538, 2002.
  • [11] George Konidaris, Scott Kuindersma, Roderic Grupen, and Andrew Barto. Robot learning from demonstration by constructing skill trees. The International Journal of Robotics Research, 31(3):360–375, 2012.
  • [12] Sanjay Krishnan, Animesh Garg, Richard Liaw, Lauren Miller, Florian T Pokorny, and Ken Goldberg. Hirl: Hierarchical inverse reinforcement learning for long-horizon tasks with delayed rewards. arXiv preprint arXiv:1604.06508, 2016.
  • [13] Sanjay Krishnan, Animesh Garg, Richard Liaw, Brijen Thananjeyan, Lauren Miller, Florian T Pokorny, and Ken Goldberg. Swirl: A sequential windowed inverse reinforcement learning algorithm for robot tasks with delayed rewards. The International Journal of Robotics Research, 38(2-3):126–145, 2019.
  • [14] Sanjay Krishnan, Animesh Garg, Sachin Patil, Colin Lea, Gregory Hager, Pieter Abbeel, and Ken Goldberg. Transition state clustering: Unsupervised surgical trajectory segmentation for robot learning. In Robotics Research, pages 91–110. Springer, 2018.
  • [15] Sergey Levine, Zoran Popovic, and Vladlen Koltun. Nonlinear inverse reinforcement learning with gaussian processes. In Advances in Neural Information Processing Systems, pages 19–27, 2011.
  • [16] Richard Liaw, Sanjay Krishnan, Animesh Garg, Daniel Crankshaw, Joseph E Gonzalez, and Ken Goldberg. Composing meta-policies for autonomous driving using hierarchical deep reinforcement learning. arXiv preprint arXiv:1711.01503, 2017.
  • [17] Alberto Maria Metelli, Matteo Pirotta, and Marcello Restelli. Compatible reward inverse reinforcement learning. In Advances in Neural Information Processing Systems, pages 2050–2059, 2017.
  • [18] Bernard Michini, Thomas Walsh, Ali-akbar Agha-mohammadi, and Jonathan P How. Bayesian nonparametric reward learning from demonstration. IEEE transactions on Robotics, 31:369–386, 2015.
  • [19] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pages 278–287, 1999.
  • [20] Pierre Sermanet, Kelvin Xu, and Sergey Levine. Unsupervised perceptual rewards for imitation learning. arXiv preprint arXiv:1612.06699, 2016.
  • [21] Alec Solway, Carlos Diuk, Natalia Córdova, Debbie Yee, Andrew G Barto, Yael Niv, and Matthew M Botvinick. Optimal behavioral hierarchy. PLoS computational biology, 10(8):e1003779, 2014.
  • [22] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
  • [23] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
  • [24] Brian D Ziebart, Andrew Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In Twenty-Third AAAI Conference on Artificial Intelligence, volume 8, pages 1433–1438, 2008.