Learner-aware Teaching: Inverse Reinforcement Learning with Preferences and Constraints

  • 2019-06-02 15:51:35
  • Sebastian Tschiatschek, Ahana Ghosh, Luis Haug, Rati Devidze, Adish Singla
  • 0

Abstract

Inverse reinforcement learning (IRL) enables an agent to learn complexbehavior by observing demonstrations from a (near-)optimal policy. The typicalassumption is that the learner's goal is to match the teacher's demonstratedbehavior. In this paper, we consider the setting where the learner has her ownpreferences that she additionally takes into consideration. These preferencescan for example capture behavioral biases, mismatched worldviews, or physicalconstraints. We study two teaching approaches: learner-agnostic teaching, wherethe teacher provides demonstrations from an optimal policy ignoring thelearner's preferences, and learner-aware teaching, where the teacher accountsfor the learner's preferences. We design learner-aware teaching algorithms andshow that significant performance improvements can be achieved overlearner-agnostic teaching.

 

Quick Read (beta)

Learner-aware Teaching: Inverse Reinforcement Learning with Preferences and Constraints

Sebastian Tschiatschek
Microsoft Research
[email protected] &Ahana Ghosh11footnotemark: 1
MPI-SWS
[email protected] &Luis Haug11footnotemark: 1
ETH Zurich
[email protected] \ANDRati Devidze
MPI-SWS
[email protected] &Adish Singla
MPI-SWS
[email protected]
Authors contributed equally to this work.
Abstract

Inverse reinforcement learning (IRL) enables an agent to learn complex behavior by observing demonstrations from a (near-)optimal policy. The typical assumption is that the learner’s goal is to match the teacher’s demonstrated behavior. In this paper, we consider the setting where the learner has her own preferences that she additionally takes into consideration. These preferences can for example capture behavioral biases, mismatched worldviews, or physical constraints. We study two teaching approaches: learner-agnostic teaching, where the teacher provides demonstrations from an optimal policy ignoring the learner’s preferences, and learner-aware teaching, where the teacher accounts for the learner’s preferences. We design learner-aware teaching algorithms and show that significant performance improvements can be achieved over learner-agnostic teaching.

 

Learner-aware Teaching: Inverse Reinforcement Learning with Preferences and Constraints


  Sebastian Tschiatschekthanks: Authors contributed equally to this work. Microsoft Research [email protected] Ahana Ghosh11footnotemark: 1 MPI-SWS [email protected] Luis Haug11footnotemark: 1 ETH Zurich [email protected] Rati Devidze MPI-SWS [email protected] Adish Singla MPI-SWS [email protected]

\@float

noticebox[b]Preprint. Under review.\[email protected]

1 Introduction

Inverse reinforcement learning (IRL) enables a learning agent (learner) to acquire skills from observations of a teacher’s demonstrations. The learner infers a reward function explaining the demonstrated behavior and optimizes her own behavior accordingly. IRL has been studied extensively [Abbeel and Ng, 2004, Ratliff et al., 2006, Ziebart, 2010, Boularias et al., 2011, Osa et al., 2018] under the premise that the learner can and is willing to imitate the teacher’s behavior.

In real-world settings, however, a learner typically does not blindly follow the teacher’s demonstrations, but also has her own preferences and constraints. For instance, consider demonstrating to an auto-pilot of a self-driving car how to quickly navigate from A to B by going through a pedestrian zone. These demonstrations might conflict with the constraint of the auto-pilot to drive only on lanes in order to ensure maximum safety of human beings. Similarly, in robot-human interaction with the goal of teaching people how to cook, a teaching robot might demonstrate to a human user how to cook “roast chicken”, which could conflict with the preferences of the learner who is “vegetarian”. To give yet another example, consider a surgical training simulator which provides virtual demonstrations of expert behavior; a novice learner might not be confident enough to imitate a difficult procedure because of safety concerns. In all these examples, the learner might not be able to acquire useful skills from the teacher’s demonstrations.

In this paper, we formalize the problem of teaching a learner with preferences and constraints. First, we are interested in understanding the suboptimality of learner-agnostic teaching, i.e., ignoring the learner’s preferences. Second, we are interested in designing learner-aware teachers who account for the learner’s preferences and thus enable more efficient learning. To this end, we study a learner model with preferences and constraints in the context of the Maximum Causal Entropy (MCE) IRL framework [Ziebart, 2010, Ziebart et al., 2013, Zhou et al., 2018]. This enables us to formulate the teaching problem as an optimization problem, and to derive and analyze algorithms for learner-aware teaching.

Our main contributions are:

  1. I

    We formalize the problem of IRL under preference constraints (Section 2 and Section 3).

  2. II

    We analyze the problem of optimizing demonstrations for the learner when preferences are known to the teacher, and we propose a bilevel optimization approach to the problem (Section 4).

  3. III

    We propose strategies for adaptively teaching a learner with preferences unknown to the teacher, and we provide theoretical guarantees under natural assumptions (Section 5).

  4. IV

    We empirically show that significant performance improvements can be achieved by learner-aware teachers as compared to learner-agnostic teachers (Section 6).

2 Problem Setting

Environment. Our environment is described by a Markov decision process (MDP) :=(𝒮,𝒜,T,γ,P0,R). Here 𝒮 and 𝒜 denote finite sets of states and actions. T:𝒮×𝒮×𝒜[0,1] describes the state transition dynamics, i.e., T(s|s,a) is the probability of landing in state s by taking action a from state s. γ(0,1) is the discounting factor. P0:𝒮[0,1] is an initial distribution over states. R:𝒮 is the reward function. We assume that there exists a feature map ϕr:𝒮[0,1]dr such that the reward function is linear, i.e., R(s)=𝐰r*,ϕr(s) for some 𝐰r*dr. Note that a bound of 𝐰r*11 ensures that |R(s)|1 for all s.

Basic definitions. A policy is a map π:𝒮×𝒜[0,1] such that π(s,) is a probability distribution over actions for every state s. We denote by Π the set of all such policies. The performance measure for policies we are interested in is the expected discounted reward R(π):=𝔼(t=0γtR(st)), where the expectation is taken with respect to the distribution over trajectories ξ=(s0,s1,s2,) induced by π together with the transition probabilities T and the initial state distribution P0. A policy π is optimal for the reward function R if πargmaxπΠR(π), and we denote optimal policies by π*. Note that R(π)=𝐰r*,μr(π), where μr:Πdr, π𝔼(t=0γtϕr(st)), is the map taking a policy to its vector of (discounted) feature expectations. We denote by Ωr={μr(π):πΠ} the image μr(Π) of this map. Note that the set Ωrdr is convex (see [Ziebart, 2010, Theorem 2.8] and [Abbeel and Ng, 2004]), and also bounded due to the discounting factor γ(0,1). For a finite collection of trajectories Ξ={s0i,s1i,s2i,}i=1,2, obtained by executing a policy π in the MDP , we denote the empirical counterpart of μr(π) by μ^r(Ξ):=1|Ξ|itγtϕr(sti).

An IRL learner and a teacher. We consider a learner 𝖫 implementing an inverse reinforcement learning (IRL) algorithm and a teacher 𝖳. The teacher has access to the full MDP ; the learner knows the MDP and the parametric form of reward function R(s)=𝐰r,ϕr(s) but does not know the true reward parameter 𝐰r*. The learner, upon receiving demonstrations from the teacher, outputs a policy π𝖫 using her algorithm. The teacher’s objective is to provide a set of demonstrations Ξ𝖳 to the learner that ensures that the learner’s output policy π𝖫 achieves high reward R(π𝖫).

The standard IRL algorithms are based on the idea of feature matching [Abbeel and Ng, 2004, Ziebart, 2010, Osa et al., 2018]: The learner’s algorithm finds a policy π𝖫 that matches the feature expectations of the received demonstrations, ensuring that μr(π𝖫)-μ^r(Ξ𝖳)2ϵ where ϵ specifies a desired level of accuracy. In this standard setting, the learner’s primary goal is to imitate the teacher (via feature matching) and this makes the teaching process easy. In fact, the teacher just needs to provide a sufficiently rich pool of demonstrations Ξ𝖳 obtained by executing π*, ensuring μ^r(Ξ𝖳)-μr(π*)2ϵ. This guarantees that μr(π𝖫)-μr(π*)22ϵ. Furthermore, the linearity of rewards and 𝐰r*11 ensures that the learner’s output policy π𝖫 satisfies R(π𝖫)R(π*)-2ϵ.

(a) Environment
(b) Set of μr(π) vectors
Figure 1: An illustrative example to showcase the suboptimality of teaching when the learner has preferences and constraints. Environment: Figure 0(a) shows a grid-world environment inspired by the object-world and gathering game environments [Levine et al., 2010, Leibo et al., 2017, Mendez et al., 2018]. Each cell represents a state, there are five actions given by “left", “up", “right", "down", “stay", the transitions are deterministic, and the starting state is the top-left cell. The agent’s goal is to collect objects in the environment: Collecting a “star" provides a reward of 1.0 and a “plus" a reward of 0.9; objects immediately appear again upon collection, and the rewards are discounted with γ close to 1. The optimal policy π* is to go to the nearest “star" and then “stay" there. Preferences: A small number of states in the environment are distractors, depicted by colored cells in Figure 0(a). We consider a learner who prefers to avoid “green" distractors: she has a hard constraint that the probability of having a “green" distractor within a 3x3 neighborhood is at most ϵ=0.1. Feature expectation vectors: Figure 0(b) shows the set of feature expectation vectors {μr(π):πΠ}. The x-axis and the y-axis represent the discounted feature count for collecting “star" and “plus" objects, respectively. The striped region represents policies that are feasible w.r.t. the learner’s constraint. Suboptimality of teaching: Upon receiving demonstrations from an optimal policy π* with feature vector μr(π*), the learner under her preference constraint can best match the teacher’s demonstrations (in a sense of minimizing μr(π𝖫)-μr(π*)2) by outputting a policy with μr(π2), which is clearly suboptimal w.r.t. the true rewards. Policy π3 with feature vector μr(π3) represents an alternate teaching policy which would have led to higher reward for the learner.

Key challenges in teaching a learner with preference constraints. In this paper, we study a novel setting where the learner has her own preferences which she additionally takes into consideration when learning a policy π𝖫 using teacher’s demonstrations. We formally specify our learner model in the next section; here we highlight the key challenges that arise in teaching such a learner. Given that the learner’s primary goal is no longer just imitating the teacher via feature matching, the learner’s output policy can be suboptimal with respect to the true reward even if she had access to μr(π*), i.e., the feature expectation vector of an optimal policy π*. Figure 1 provides an illustrative example to showcase the suboptimality of teaching when the learner has preferences and constraints. The key challenge that we address in this paper is that of designing a teaching algorithm that selects demonstrations while accounting for the learner’s preferences.

3 Learner Model

In this section we describe the learner models we consider, including different ways of defining preferences and constraints. First, we introduce some notation and definitions that will be helpful. We capture learner’s preferences via a feature map ϕc:𝒮[0,1]dc. We define ϕ(s) as a concatenation of the two feature maps ϕr(s) and ϕc(s) given by [ϕr(s),ϕc(s)] and let d=dr+dc. Similar to the map μr, we define μc:Πdc, π𝔼(t=0γtϕc(st)) and μ:Πd, π𝔼(t=0γtϕ(st)). Similar to Ωr, we define Ωcdc and Ωd as the images of the maps μc(Π) and μ(Π). Note that for any policy πΠ, we have μ(π)=[μr(π),μc(π)].

Standard (discounted) MCE-IRL. Our learner models build on the (discounted) Maximum Causal Entropy (MCE) IRL framework [Ziebart et al., 2008, Ziebart, 2010, Ziebart et al., 2013, Zhou et al., 2018]. In the standard (discounted) MCE-IRL framework, a learning agent aims to identify a policy that matches the feature expectations of the teacher’s demonstrations while simultaneously maximizing the (discounted) causal entropy given by H(π):=H({at}t=0,1,{st}t=0,1,):=t=0γt𝔼[-logπ(atst)]. More background is provided in Appendix C.

Including preference constraints. The standard framework can be readily extended to include learner’s preferences in the form of constraints on the preference features ϕc. Clearly, the learner’s preferences can render exact matching of the teacher’s demonstrations infeasible and hence we relax this condition. To this end, we consider the following generic learner model:

maxπ,δrsoft0,δcsoft0 H(π)-Crδrsoftp-Ccδcsoftp (1)
s.t. |μr(π)[i]-μ^r(Ξ𝖳)[i]|δrhard[i]+δrsoft[i]i{1,2,,dr}
gj(μc(π))δcsoft[j]j{1,2,,m}

Here, g:dc are m convex functions representing preference constraints. We denote the parameters and variables in vector notation as δrhard0dr, δrsoft0dr, and δcsoft0dc. The coefficients Cr and Cc are parameters and quantify the importance of matching the teacher’s demonstrations and satisfying the learner’s preferences. Next, we discuss two special instances of this generic learner model.

3.1 Learner Model with Hard Preferences Constraints

It is instructive to study a special case of the above-mentioned generic learner model with δrhard=0, and a limiting case with Cr,Cc0 and CcCr. Intuitively, the preferences take the form of hard constraints, i.e., the learner’s output policy must satisfy gj(μc(π))0j{1,2,,m}. Additionally, while satisfying these hard constraints, the learner minimizes the Lp norm distance to the teacher’s demonstration. We formally describe the learner’s behavior below.

First, we define the learner’s constraint set as Ω𝖫:={μ:μΩ s.t. gj(μc)0j{1,2,,m}}. Similar to Ω𝖫, we define Ωc𝖫Ωc and Ωr𝖫Ωr. Also, note that Ωr𝖫 and Ωc𝖫 are projections of the set Ω𝖫 to the subspaces dr and dc respectively. Then, the learner’s behavior can be approximated as:

  1. (i)

    Learner can match: When μ^r(Ξ𝖳)Ωr𝖫, the learner outputs a policy π𝖫 s.t. μr(π𝖫)=μ^r(Ξ𝖳).

  2. (ii)

    Learner cannot match: Otherwise, the learner outputs a policy π𝖫 such that μr(π𝖫) is given by the Lp norm projection of the vector μ^r(Ξ𝖳) onto the set Ωr𝖫.

Figure 1 provides an illustration of the behavior of this learner model. We will design learner-aware teaching algorithms for this learner model in Section 4.1 and Section 5.

3.2 Learner Model with Soft Preference Constraints

Another interesting learner model that we study in this paper arises from the generic learner when we consider m=dc number of box-type linear constraints with gj(μc(π))=μc(π)[j]-δchard[j]j{1,2,,dc}. We consider L1 norm penalty on violation, and for simplicity we consider δrhard[i]=0i{1,2,,dr}. In this case, the learner’s model is given by

maxπ,δrsoft,low0,δrsoft,up0,δcsoft,up0 H(π)-i=1drCr(δrsoft,low[i]+δrsoft,up[i])-j=1dcCcδcsoft,up[j] (2)
s.t. μ^r(Ξ𝖳)[i]-μr(π)[i]δrsoft,low[i]i{1,2,,dr}
μr(π)[i]-μ^r(Ξ𝖳)[i]δrsoft,up[i]i{1,2,,dr}
μc(π)[j]δchard[j]+δcsoft,up[j]j{1,2,,dc}

The solution to the above problem corresponds to a softmax policy with a reward function R𝝀(s)=𝒘𝝀,ϕ(s) where 𝒘𝝀d is parametrized by 𝝀. The optimal parameters 𝝀 can be computed efficiently and the corresponding softmax policy is then obtained by Soft-Value-Iteration procedure (see [Ziebart, 2010, Algorithm. 9.1], [Zhou et al., 2018]). Details are provided in Appendix D. We will design learner-aware teaching algorithms for this learner in Section 4.2.

4 Learner-aware Teaching under Known Constraints

In this section, we analyze the setting when the teacher has full knowledge of the learner’s constraints.

4.1 A Learner-aware Teacher for Hard Preferences: Aware-CMDP

Here, we design a learner-aware teaching algorithm when considering the learner from Section 3.1. Given that the teacher has full knowledge of the learner’s preferences, it can compute an optimal teaching policy by maximizing the reward over policies that satisfy the learner’s preference constraints, i.e., the teacher solves a constrained-MDP problem (see [De, 1960, Altman, 1999]) given by

maxπ 𝐰r*,μr(π)s.t.μr(π)Ωr𝖫.

We refer to an optimal solution of this problem as πaware and the corresponding teacher as Aware-CMDP. We can make the following observation formalizing the value of learner-aware teaching:

Theorem 1.

For simplicity, assume that the teacher can provide an exact feature expectation μ(π) of a policy instead of providing demonstrations to the learner. Then, the value of learner-aware teaching is

maxπ s.t. μr(π)Ωr𝖫𝐰r*,μr(π)-𝐰r*,ProjΩr𝖫(μr(π*))0.

When the set Ω is defined via a set of linear constraints, the above problem can be formulated as a linear program and solved exactly; details are provided in Appendix E.

4.2 A Learner-aware Teacher for Soft Preferences via Bi-level Optimization: Aware-BiL

For the learner models in Section 3, the optimal learner-aware teaching problem can be naturally formalized as the following bi-level optimization problem:

maxπ𝖳 R(π𝖫)s.t.π𝖫argmaxπIRL(π,μ(π𝖳)), (3)

where IRL(π,μ(π𝖳)) stands for the IRL problem solved by the learner given demonstrations from π𝖳 and can include preferences of the learner (see Eq. 1 in Section 3).

There are many possibilities for solving this bi-level optimization problem—see for example [Sinha et al., 2018] for an overview. In this paper we adopted a single-level reduction approach to simplify the above bi-level optimization problem as this results in particularly intuitive optimiziation problems for the teacher. The basic idea of single-level reduction is to replace the lower-level problem, i.e., argmaxπIRL(π,μ(π𝖳)), by the optimality conditions for that problem given by the Karush-Kuhn-Tucker conditions [Boyd and Vandenberghe, 2004, Sinha et al., 2018]. For the learner model outlined in Section 3.2, these reductions take the following form (see Appendix F in the supplementary material for details):

max𝝀:={𝜶lowdr,𝜶updr,𝜷dc} 𝐰r*,μr(π𝝀) (4)
s.t. 0𝜶lowCr
0𝜶upCr
{0𝜷Cc AND μc(π𝝀)δchard}  OR {𝜷=Cc AND μc(π𝝀)δchard}

where π𝝀 corresponds to a softmax policy with a reward function R𝝀(s)=𝒘𝝀,ϕ(s) for 𝒘𝝀=[(𝜶low-𝜶up),-𝜷]. Thus, finding optimal demonstrations means optimization over softmax teaching policies while respecting the learner’s preferences. To actually solve the above optimization problem and find good teaching policies, we use an approach inspired by the Frank-Wolfe algorithm [Jaggi, 2013] detailed in Appendix F. We refer to a teacher implementing this approach as Aware-BiL.

5 Learner-Aware Teaching Under Unknown Constraints

In this section, we consider the more realistic and challenging setting in which the teacher 𝖳 does not know the learner 𝖫’s constraint set Ωr𝖫. Without feedback from 𝖫, 𝖳 can generally not do better than the agnostic teacher who simply ignores any constraints. We therefore assume that 𝖳 and 𝖫 interact in rounds as described by Algorithm 5. The two versions of the algorithm we describe in Sections 5.1 and 5.2 are obtained by specifying how 𝖳 adapts the teaching policy in each round.

{algorithm}

[H] Teacher-learner interaction in the adaptive teaching setting {algorithmic}[1] \StateInitial teaching policy π𝖳,0 (e.g., optimal policy ignoring any constraints) \Forround i=0,1,2, \StateTeacher provides demonstrations with feature vector μr𝖳,i using policy π𝖳,i \StateLearner upon receiving μr𝖳,i computes a policy π𝖫,i with feature vector μr𝖫,i \StateTeacher observes learner’s feature vector μr𝖫,i and adapts the teaching policy \EndFor

In this section, we assume that 𝖫 is as described in Section 3.1: Given demonstrations Ξ𝖳, 𝖫 finds a policy π𝖫 such that μr(π𝖫) matches the L2-projection of μ^r(Ξ𝖳) onto Ωr𝖫. For the sake of simplifying the presentation and the analysis, we also assume that 𝖫 and 𝖳 can observe the exact feature expectations of their respective policies, e.g., μ^r(Ξ𝖳)=μr(π𝖳) if Ξ𝖳 is sampled from π𝖳.

5.1 An Adaptive Learner-aware Teacher Using Volume Search: AdAware-Vol

In our first adaptive teaching algorithm AdAware-Vol, 𝖳 maintains an estimate Ω^r𝖫Ωr𝖫 of the learner’s constraint set, which in each round gets updated by intersecting the current version with a certain affine halfspace, thus reducing the volume of Ω^r𝖫. The new teaching policy is then any policy π𝖳,i+1 which is optimal under the constraint that μ𝖳,i+1Ω^r𝖫. The interaction ends as soon as μr𝖫,i-μr𝖳,i2ϵ for a threshold ϵ. Details on the algorithm are provided in Appendix B.1.

Theorem 2.

Upon termination of AdAware-Vol, L’s output policy πL satisfies R(πL)R(πaware)-ϵ for any policy πaware which is optimal under L’s constraints. For the special case that ΩrL is a polytope defined by m linear inequalities, the algorithm terminates in O(mdr) iterations.

5.2 An Adaptive Learner-aware Teacher Using Line Search: AdAware-Lin

In our second adaptive teaching algorithm, AdAware-Lin, 𝖳 adapts the teaching policy by performing a binary search on a line segment of the form {μr𝖫,i+α𝐰r*|α[αmin,αmax]}dr to find a vector μr𝖳,i+1=μr𝖫,i+αi𝐰r* that is the vector of feature expectations of a policy; here αmax>αmin>0 are fixed constants. If that is not successful, the teacher finds a teaching policy with μr𝖳,i+1argminμrΩrμr-μr𝖫,i-αmin𝐰r*2. The following theorem analyzes the convergence of 𝖫’s performance to R¯𝖫:=maxμrΩrR(μr) under the assumption that 𝖳’s search succeeds in every round. Further details and the proof of the theorem are provided in Appendix B.2.

Theorem 3.

Fix some ε>0 and assume that there exists a constant αmin>0 such that, as long as R¯L-R(μrL,i)>ε, the teacher can find a teaching policy πT,i+1 satisfying μrT,i+1=μrL,i+αiwr* for some αiαmin. Then the learner’s performance increases monotonically in each round of AdAware-Lin, i.e., R(μrL,i+1)>R(μrL,i). Moreover, after at most O(D2εαminlogDε) teaching steps, the learner’s performance satisfies R(μrL,i)>R¯L-2ε. Here we abbreviate D:=diamΩr.

6 Experimental Evaluation

In this section we evaluate our teaching algorithms for different types of learners on the environment introduced in Figure 1. The environment we consider here has three reward objects, i.e., a “star" object with reward of 1.0, a “plus" object with reward of 0.9 and a “dot" object with reward of 0.2 such that dr=3. Two objects of each type are placed randomly on the grid. Furthermore, there are two types of distractors: (i) two “green" distractors are randomly placed at a distance of 0-cell and 1-cell to the “star" objects; (ii) two “yellow" distractors are randomly placed at a distance of 1-cell and 2-cells to the “plus" objects, see Figure 1(a). We have a total of 6 preference features ϕc(s) with dc=6 as follows: the first three features in ϕc(s) are binary-indicators whether there is a “green" distractor at a distance of 0-cell, 1-cell, and 2-cells; similarly the next three features are binary-indicators for the “yellow" distractor. We use a discount factor of γ=0.99. Upon collecting an object, there is a 0.1 probability of transiting to a terminal state.

We consider the learner with soft constraints from Section 3.2 for Cr=5, Cc=10 and δchard=0. We have a total of 5 different learners depending on the preference features used by them out of 6 total preference features discussed above, see Figure 1(a), e.g., L1 learner has no preference features, L2 learner has first two preference features (i.e., ϕc(s)[1],ϕc(s)[2]), L3 learner has first four preference features, L4 learner has first five preference features, and L5 learner has all six preference features. The first row in Figure 2 shows the considered object-worlds and indicates the preference of the learners to avoid certain regions by the gray area.

Table 1: Learners’ average rewards after teaching. L1, , L5 correspond to learners with preferences similar to those in Figure 2. Results are averaged over 10 random object-worlds, ± standard error
Learner (Cc=10,Cr=5)
L1 L2 L3 L4 L5
Teacher Agnostic 7.98±0.06 0.01±0.00 0.01±0.00 0.01±0.00 0.00±0.00
Aware-BiL 7.95±0.08 6.64±0.53 3.75±0.94 2.59±0.76 1.30±0.21
(a) Environments and learners’ preferences for 5 different learners L1, , L5
(b) Learners’ rewards inferred from learner-agnostic teacher’s (Agnostic) demonstrations
(c) Learners’ rewards inferred from learner-aware teacher’s (Aware-BiL) demonstrations
Figure 2: Teaching in object-world environments under full knowledge of the learner’s preferences. Green and yellow cells indicate distractors associated with either “star" or “plus" objects, respectively. Learner’s preferences to avoid cells are indicated in gray. Learner model from Section 3.2 with Cc=10,Cr=5, δchard=0 is considered for these experiments. The learner-aware teacher enable the learner to infer reward functions that are compatible with the learner’s preferences and achieve higher average rewards. In Figure 1(b) and Figure 1(c), blue color represents positive reward, red color represents negative reward, and the magnitude of the reward is indicated by color intensity.

6.1 Teaching under known constraints

Our first set of results are presented in Figure 2. The second and third row show the reward function inferred by the learner for demonstrations provided by a learner-agnostic teacher (Agnostic) and the bi-level learner-aware teacher (Aware-BiL), respectively. We observe that Agnostic fails to teach the learner about objects’ positive rewards in cases where the learners’ preferences conflict with the position of the most rewarding objects (second row). In contrast, Aware-BiL always successfully teaches the learners about rewarding objects that are compatible with the learners’ preferences (third row).

We also compare Agnostic and Aware-BiL in terms of reward achieved by the learner after teaching for object worlds of size 10×10 in Table 1. The numbers show the average reward over 10 randomly generated object-worlds. We observe, that a learner can learn better policies from a teacher that knows about the learner’s preferences and takes them into account.

6.2 Teaching under unknown constraints

In this section we evaluate the teaching algorithms from Section 5. We consider the learner model from Section 3.1 that uses L2-projection to match reward feature expectations as studied in Section 5. Here, we study the learner who considers first two preference features (i.e., ϕc(s)[1],ϕc(s)[2]): this corresponds to an object-world similar to that in Figure 2 (learner L2 with second grid-world). For modeling the hard constraints, we consider box-type linear constraints with δchard[1]=δchard[2]=2.5 for these two preference features (also see Eq. 2).

In this context it is instructive to investigate how quickly these adaptive teaching strategies converge to the performance of a teacher who would have full knowledge about the learner. Results comparing the adaptive teaching strategies (AdAware-Vol and AdAware-Lin) are shown in Figure 3. We can observe that both teaching strategies converge to the best possible performance under full knowledge about the learner (Aware-CMDP).

We also provide results showing the performance achieved by the adaptive teaching strategies on object-worlds of varying sizes, see Figure 3. Note that the performance of AdAware-Vol decreases slightly when teaching for more rounds, i.e., comparing the results after 3 teaching rounds and at the end of the teaching process. This is because of approximations when learner is computing the policy via projection, which in turn leads to errors on the teacher side when approximating Ω^r𝖫. In contrast, AdAware-Lin performance always increases when teaching for more rounds.

(a) Reward over teaching rounds
   10×10 15×15 20×20 Aware-CMDP 7.51 7.25 7.18 Agnostic 3.22 3.18 3.17 Conserv 1.36 1.61 1.56 AdAware-Vol (3rd) 7.39 7.13 6.92 AdAware-Vol (end) 6.9 6.66 6.37 AdAware-Lin (3rd) 6.19 6.03 6.25 AdAware-Lin (end) 7.37 7.27 7.06 (b) Varying grid-size
Figure 3: Performance of adaptive teaching strategies AdAware-Vol and AdAware-Lin. (left) Figure 2(a) shows the reward for learner’s policy over number of teaching interactions. The horizontal lines indicate the performance of learner’s policy for the learner-aware teacher with full knowledge of the learner’s constraints Aware-CMDP, the learner-agnostic teacher Agnostic who ignores any constraints, and a conservative teacher Conserv who considers all 6 constraints. Our adaptive teaching strategies AdAware-Vol and AdAware-Lin significantly outperform baselines (Agnostic and Conserv) and quickly converge towards the optimal performance of Aware-CMDP. The dotted lines AdAware-Vol:T and AdAware-Lin:T show the rewards corresponding to teacher’s policy at a round and are shown to highlight the very different behavior of two adaptive teaching strategies. (right) Table 2(b) shows results for varying grid-size of the environment. Results are reported at i=3rd round and at the “end" round when algorithm reaches it’s stopping criterion. Results are reported as average over 5 runs, where each run corresponds to a random environment.

7 Related Work

Our work is closely related to algorithmic machine teaching [Goldman and Kearns, 1995, Zhu et al., 2018], whose general goal is to design teaching algorithms that optimize the data that is provided to a learning algorithm. Algorithmic teaching provides a rigorous formalism for a number of real-world applications such as personalized education and intelligent tutoring systems [Patil et al., 2014, Zhu, 2015, Rafferty et al., 2016, Hunziker et al., 2018], social robotics [Cakmak and Thomaz, 2014], and human-in-the-loop systems [Singla et al., 2013, Singla et al., 2014].

Most works in machine teaching so far focus on supervised learning tasks and assume that the learning algorithm is fully known to the teacher, see e.g. [Zhu, 2013, Singla et al., 2014, Liu and Zhu, 2016, Mac Aodha et al., 2018]. In the IRL setting, few works study how to provide maximally informative demonstrations to the learner, e.g., [Cakmak and Lopes, 2012, Brown and Niekum, 2019]. In contrast to our work, their teacher fully knows the learner model and provides the demonstrations without any adaptation to the learner. The question of how a teacher should adaptively react to a learner has been addressed by [Liu et al., 2018, Chen et al., 2018, Melo et al., 2018, Yeo et al., 2019], but only in the supervised setting. In recent work, [Kamalaruban et al., 2019] studies interactive teaching algorithms for an IRL learner, however, they consider a sequential learner, and there is no notion of learner’s preferences and constraints in their setting.

Within the area of IRL, there is a line of work on active learning approaches [Cohn et al., 2011, Brown et al., 2018, Amin et al., 2017, Cui and Niekum, 2018], which is related to our work in the sense that they consider the question of how to optimize demonstrations for a given learner. In contrast to us, they take the perspective of the learner who actively influences the demonstrations she receives. A few papers have addressed aspects of the problem that arises in IRL when the learner does not have full access to the reward features, e.g., [Levine et al., 2010] and [Haug et al., 2018].

Our work is also loosely related to multi-agent reinforcement learning. [Dimitrakakis et al., 2017] studies the interaction between agents with misaligned models with a focus on the question of how to jointly optimize a policy. Also [Hadfield-Menell et al., 2016] study the cooperation between agents which do not perfectly understand each other.

8 Conclusions

In the context of inverse reinforcement learning, we investigated the important problem of interacting with learners that have preferences and constraints that prevent them from closely approximating the teacher’s demonstrations. We demonstrated the suboptimality of learner-agnostic teaching and proposed algorithms for learner-aware teaching strategies for known and unknown preferences of the learner. In future work, we will evaluate our approach in machine-human and human-machine tasks and extend our approach to other learner models.

References

  • [Abbeel and Ng, 2004] Abbeel, P. and Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. In ICML.
  • [Altman, 1999] Altman, E. (1999). Constrained Markov decision processes, volume 7. CRC Press.
  • [Amin et al., 2017] Amin, K., Jiang, N., and Singh, S. P. (2017). Repeated inverse reinforcement learning. In NIPS, pages 1813–1822.
  • [Boularias et al., 2011] Boularias, A., Kober, J., and Peters, J. (2011). Relative entropy inverse reinforcement learning. In AISTATS, pages 182–189.
  • [Boyd and Vandenberghe, 2004] Boyd, S. and Vandenberghe, L. (2004). Convex optimization. Cambridge university press.
  • [Brown et al., 2018] Brown, D. S., Cui, Y., and Niekum, S. (2018). Risk-Aware Active Inverse Reinforcement Learning. In Conference on Robot Learning, pages 362–372.
  • [Brown and Niekum, 2019] Brown, D. S. and Niekum, S. (2019). Machine teaching for inverse reinforcement learning: Algorithms and applications. In AAAI.
  • [Cakmak and Lopes, 2012] Cakmak, M. and Lopes, M. (2012). Algorithmic and human teaching of sequential decision tasks. In AAAI.
  • [Cakmak and Thomaz, 2014] Cakmak, M. and Thomaz, A. L. (2014). Eliciting good teaching from humans for machine learners. Artificial Intelligence, 217:198–215.
  • [Chen et al., 2018] Chen, Y., Singla, A., Mac Aodha, O., Perona, P., and Yue, Y. (2018). Understanding the role of adaptivity in machine teaching: The case of version space learners. In Advances in Neural Information Processing Systems, NeurIPS’18, pages 1476–1486.
  • [Cohn et al., 2011] Cohn, R., Durfee, E., and Singh, S. (2011). Comparing Action-query Strategies in Semi-autonomous Agents. In AAMAS, pages 1287–1288, Richland, SC.
  • [Cui and Niekum, 2018] Cui, Y. and Niekum, S. (2018). Active reward learning from critiques. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6907–6914. IEEE.
  • [De, 1960] De, G. G. (1960). Les problemes de decisions sequentielles. cahiers du centre d’etudes de recherche operationnelle vol. 2, pp. 161-179.
  • [Dimitrakakis et al., 2017] Dimitrakakis, C., Parkes, D. C., Radanovic, G., and Tylkin, P. (2017). Multi-view decision processes: the helper-ai problem. In Advances in Neural Information Processing Systems, pages 5443–5452.
  • [Dudík et al., 2007] Dudík, M., Phillips, S. J., and Schapire, R. E. (2007). Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. Journal of Machine Learning Research, 8:1217–1260.
  • [Goldman and Kearns, 1995] Goldman, S. A. and Kearns, M. J. (1995). On the complexity of teaching. Journal of Computer and System Sciences, 50(1):20–31.
  • [Hadfield-Menell et al., 2016] Hadfield-Menell, D., Russell, S. J., Abbeel, P., and Dragan, A. (2016). Cooperative inverse reinforcement learning. In NIPS.
  • [Haug et al., 2018] Haug, L., Tschiatschek, S., and Singla, A. (2018). Teaching inverse reinforcement learners via features and demonstrations. In Advances in Neural Information Processing Systems, NeurIPS’18, pages 8464–8473.
  • [Hunziker et al., 2018] Hunziker, A., Chen, Y., Mac Aodha, O., Gomez-Rodriguez, M., Krause, A., Perona, P., Yue, Y., and Singla, A. (2018). Teaching multiple concepts to a forgetful learner. CoRR, abs/1805.08322.
  • [Jaggi, 2013] Jaggi, M. (2013). Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning, pages 427–435.
  • [Kamalaruban et al., 2019] Kamalaruban, P., Devidze, R., Cevher, V., and Singla, A. (2019). Interactive teaching algorithms for inverse reinforcement learning. In IJCAI.
  • [Kazama and Tsujii, 2005] Kazama, J. and Tsujii, J. (2005). Maximum entropy models with inequality constraints: A case study on text categorization. Machine Learning, 60(1-3):159–194.
  • [Leibo et al., 2017] Leibo, J. Z., Zambaldi, V., Lanctot, M., Marecki, J., and Graepel, T. (2017). Multi-agent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pages 464–473.
  • [Levine et al., 2010] Levine, S., Popovic, Z., and Koltun, V. (2010). Feature construction for inverse reinforcement learning. In NIPS, pages 1342–1350.
  • [Liu and Zhu, 2016] Liu, J. and Zhu, X. (2016). The teaching dimension of linear learners. Journal of Machine Learning Research, 17(162):1–25.
  • [Liu et al., 2018] Liu, W., Dai, B., li, X., Rehg, J. M., and Song, L. (2018). Towards black-box iterative machine teaching. In ICML.
  • [Mac Aodha et al., 2018] Mac Aodha, O., Su, S., Chen, Y., Perona, P., and Yue, Y. (2018). Teaching categories to human learners with visual explanations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3820–3828.
  • [Melo et al., 2018] Melo, F. S., Guerra, C., and Lopes, M. (2018). Interactive optimal teaching with unknown learners. In IJCAI, pages 2567–2573.
  • [Mendez et al., 2018] Mendez, J. A. M., Shivkumar, S., and Eaton, E. (2018). Lifelong inverse reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS) 2018, pages 4507–4518.
  • [Osa et al., 2018] Osa, T., Pajarinen, J., Neumann, G., Bagnell, J. A., Abbeel, P., Peters, J., et al. (2018). An algorithmic perspective on imitation learning. Foundations and Trends® in Robotics, 7(1-2):1–179.
  • [Patil et al., 2014] Patil, K. R., Zhu, X., Kopeć, Ł., and Love, B. C. (2014). Optimal teaching for limited-capacity human learners. In NIPS, pages 2465–2473.
  • [Rafferty et al., 2016] Rafferty, A. N., Brunskill, E., Griffiths, T. L., and Shafto, P. (2016). Faster teaching via pomdp planning. Cognitive science, 40(6):1290–1332.
  • [Ratliff et al., 2006] Ratliff, N. D., Bagnell, J. A., and Zinkevich, M. A. (2006). Maximum margin planning. In ICML, pages 729–736.
  • [Singla et al., 2013] Singla, A., Bogunovic, I., Bartók, G., Karbasi, A., and Krause, A. (2013). On actively teaching the crowd to classify. In NIPS Workshop on Data Driven Education.
  • [Singla et al., 2014] Singla, A., Bogunovic, I., Bartók, G., Karbasi, A., and Krause, A. (2014). Near-optimally teaching the crowd to classify. In ICML.
  • [Sinha et al., 2018] Sinha, A., Malo, P., and Deb, K. (2018). A review on bilevel optimization: from classical to evolutionary approaches and applications. IEEE Transactions on Evolutionary Computation, 22(2):276–295.
  • [Yeo et al., 2019] Yeo, T., Kamalaruban, P., Singla, A., Merchant, A., Asselborn, T., Faucon, L., Dillenbourg, P., and Cevher, V. (2019). Iterative classroom teaching. In AAAI.
  • [Zhou et al., 2018] Zhou, Z., Bloem, M., and Bambos, N. (2018). Infinite time horizon maximum causal entropy inverse reinforcement learning. IEEE Trans. Automat. Contr., 63(9):2787–2802.
  • [Zhu, 2013] Zhu, X. (2013). Machine teaching for bayesian learners in the exponential family. In NIPS, pages 1905–1913.
  • [Zhu, 2015] Zhu, X. (2015). Machine teaching: An inverse problem to machine learning and an approach toward optimal education. In AAAI, pages 4083–4087.
  • [Zhu et al., 2018] Zhu, X., Singla, A., Zilles, S., and Rafferty, A. N. (2018). An overview of machine teaching. CoRR, abs/1801.05927.
  • [Ziebart, 2010] Ziebart, B. D. (2010). Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University.
  • [Ziebart et al., 2013] Ziebart, B. D., Bagnell, J. A., and Dey, A. K. (2013). The principle of maximum causal entropy for estimating interacting processes. IEEE Transactions on Information Theory, 59(4):1966–1980.
  • [Ziebart et al., 2008] Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. (2008). Maximum entropy inverse reinforcement learning. In AAAI.

Appendix A List of Appendices

In this section we provide a brief description of the content provided in the appendices of the paper.

  • Appendix B provides additional details on the adaptive teaching strategies (Section 5).

  • Appendix C provides background on the (discounted) MCE-IRL problem (Section 3).

  • Appendix D provides additional details on the (discounted) MCE-IRL problem with preferences (Section 3.2).

  • Appendix E provides the LP formulation for the teacher Aware-CMDP (Section 4.1).

  • Appendix F provides additional details on the bi-level optimization approach for the teacher Aware-BiL (Section 4.2).

Appendix B Details for Learner-Aware Teaching under Unknown Constraints (Section 5)

In this appendix, we provide more details on the adaptive teaching algorithms AdAware-Vol and AdAware-Lin described in Sections 5.1 and 5.2. Recall that both teaching algorithms are obtained from Algorithm 5 by defining the way in which the teacher 𝖳 adapts the teaching policy based on the learner 𝖫’s feature expectations μr𝖫 in past rounds.

B.1 Details for AdAware-Vol (Section 5.1)

Estimation of the learner’s constraint set.

In AdAware-Vol, 𝖳 maintains an estimate Ω^r𝖫,i of 𝖫’s constraint set, starting with Ω^r𝖫,0=Ωr. After observing the feature expectations μr𝖫,i of the policy 𝖫 found in round i, 𝖳 updates this estimate as follows:

Ω^r𝖫,i+1:=Ω^r𝖫,i{μr𝖫,i+νdr|μr𝖳,i-μr𝖫,i,ν0} (5)

The set on the right hand side of (5) with which Ωr𝖫,i gets intersected is a halfspace containing Ωr𝖫. This is due to the fact that Ωr𝖫 is convex by assumption, and to our assumption that 𝖫’s learning algorithm is such that it outputs a policy whose feature expectations μr𝖫,i match the L2-projection of μr𝖳,i to Ωr𝖫. Inductively, it follows that Ω^r𝖫,iΩr𝖫 for all i.

In practice, we implement a slightly modified version of the update step in which we intersect Ω^r𝖫,i with a halfspace that is shifted in the direction of μr𝖳,i-μr𝖫,i by a small amount, i.e., we use

{μr𝖫,i+(1-η)(μr𝖳,i-μr𝖫,i)+νdr|μr𝖳,i-μr𝖫,i,ν0}

with a step size parameter η(0,1). This helps make the algorithm more robust to noise in the learner’s feature expectations. In our experiments, we used η=0.9.

Update of the teaching policy.

After updating the estimate of the learner’s constraint set to Ω^r𝖫,i, 𝖳 solves a constrained MDP in order to find

π𝖳,i+1argmaxπ,μr(π)Ω^r𝖫,iR(π).

Given that Ω^r𝖫,i is cut out by linear equations, solving the constrained MDP reduces to solving an LP, as described in Appendix E.

Termination of the interaction.

The algorithm terminates as soon as the stopping criterion μr𝖫,i-μr𝖳,i2ϵ is satisfied. Note that Ω^r𝖫,iΩr𝖫 implies that

R(π𝖳,i)R(πaware)

for any πawareargmaxπ,μr(π)Ωr𝖫R(π). Therefore, after termination we have

R(π𝖫,i)R(πaware)-ϵ

for any policy πaware which is optimal under 𝖫’s constraints, which is the first statement of Theorem 2.

The second statement of Theorem 2 follows from the fact that if Ωr𝖫 is a convex polytope cut out by m linear inequalities, the number of faces, which is in O(mdr), is an upper bound on the number of iterations of the algorithm, because one face is “eliminated” in each round.

B.2 Details for AdAware-Lin (Section 5.2)

{algorithm} [H] {algorithmic}[1] \Requireμr𝖫, αmin, αmax, εα, εμ. \Stateαuαmax, αlαmin \Whileαu-αl>εα \Stateα(αu+αl)/2 \Stateπ𝖳IRL(μr𝖫+α𝐰r*) \Ifμr(π𝖳)-μr𝖫-α𝐰r*2>εμ \Stateαuα \Else\Stateαlα \EndIf \EndWhile \Ifμr(π𝖳)-μr𝖫-α𝐰r*2>εμ \Stateπ𝖳IRL(μr𝖫+αmin𝐰r*) \EndIf \State\Returnπ𝖳
Figure 4: LineSearch
Figure 5:
LineSearch is the algorithm that 𝖳 uses in order to find a teaching policy π𝖳 provided that the feature expectations of 𝖫’s current policy are μr𝖫. Figure 5 illustrates the two cases may occur: For the right μr𝖫, LineSearch returns a policy π𝖳 whose feature expectations satisfy μr𝖳=μr𝖫+α*𝐰r* such that α*>αmin. For the left μr𝖫, LineSearch returns a policy π𝖳 whose feature expectations satisfy μr𝖳argminμrΩrμr-μr𝖫+αminμr𝖳.

In AdAware-Lin, 𝖳 updates the teaching policy π𝖳,i+1 based on 𝖫’s feature expectations μr𝖫,i from the previous round. To do so, 𝖳 uses LineSearch (Algorithm B.2) to perform a binary search on the line segment

{μr𝖫,i+α𝐰r*|α[αmin,αmax]}dr (6)

in order to find a vector μr that is realizable as the vector of feature expectations of a policy. If the intersection of the line segment (6) with Ωr is non-empty, it is of the form {μr𝖫+α𝐰r*|α[αmin,α*]} for some α*αmax due to the convexity of Ωr. In that case, LineSearch returns a policy with feature expectations

μr𝖳,i+1=μr𝖫,i+αi*𝐰r*,

where αi* is the maximal α[αmin,αmax] such that μr𝖫,i+α𝐰r*Ωr. If the intersection is empty, LineSearch returns a policy with feature expectations

μr𝖳,i+1argminμrΩrμr-μr𝖫,i-αmin𝐰r*2.

Figure 5 illustrates the two cases that may occur.

B.2.1 Proof of Theorem 3

In this section, we provide the proof of Theorem 3, which gives a guarantee on the improvement of 𝖫’s performance in each round of the AdAware-Lin algorithm. The assumption we make here is that, in every teaching round, LineSearch returns a teaching policy π𝖳,i+1 such that μr𝖳,i+1=μr𝖫,i+αi𝐰r* for some αiαmin, where αmin>0 is a fixed constant. It is easy to see that this assumption, together with our assumption on 𝖫’s algorithm and the convexity of Ωr𝖫, imply that the change in learner performance

ΔRi:=R(μr𝖫,i+1)-R(μr𝖫,i)

is non-negative in every teaching round. The following proposition, which will be needed in the proof of Theorem 3, strengthens this statement:

Proposition 1.

Let R¯L:=maxμrΩrLR(μr) be the maximally achievable learner performance. Assume that, in teaching round i, T can find a teaching policy πT,i+1 whose feature expectations satisfy μrT,i+1=μrL,i+αiwr* for some αi>0. Then

R¯𝖫-R(μr𝖫,i)ΔRi+DΔRiαi-ΔRi, (7)

where D=diamΩr.

Proof of Proposition 1.

Consider the plane Vdr spanned by μr𝖫,i,μr𝖳,i+1 and μr𝖫,i+1 and denote by μ~r the unique point in V with the properties that

  1. (a)

    𝐰r*,μ~r=𝐰r*,μr𝖫,i+1,

  2. (b)

    μ~r lies on the same side of the line through μ𝖫,i and μ𝖳,i+1 as μr𝖫,i+1, and

  3. (c)

    μ~r,μr𝖳,i+1 and μr𝖫,i span a right triangle with μ~r at the right-angled corner.

Note that μr𝖫,i+1 must lie inside this triangle, i.e., on the red line segment in Figure 6: Otherwise there would a point on the line segment connecting μr𝖫,i+1 and μr𝖫,i, and hence in Ωr𝖫 by convexity, which is closer to μr𝖳,i+1 than μr𝖫,i+1, contradicting the fact that μr𝖫,i+1 is closest to μr𝖳,i+1 among all points in Ωr𝖫. Denote by ~ the line passing through μ~r and μr𝖫,i.

Figure 6: Illustration of the proof of Proposition 1: The smaller the performance increase ΔRi, the better the upper bound on the gap R¯Ω-R(μr𝖫,i).

The facts that Ωr𝖫 is convex and that μr𝖫,i+1=argminμrΩr𝖫μr𝖳,i+1-μr2 imply that Ωr𝖫 must lie on one side of the hyperplane

μr𝖫,i+1+(μr𝖳,i+1-μr𝖫,i+1)dr.

Therefore, we can upper bound R¯𝖫 in terms of the slope s of the line which arises by intersecting that hyperplane with V:

R¯𝖫R(μr𝖫,i+1)+Ds=R(μr𝖫,i)+ΔRi+Ds. (8)

Note that the slope s is upper bounded by the slope s~ of ~. We have s~=ΔRih, where h is the length of the red line segment in Figure 6, and h=(αi-ΔRi)ΔRi by Pythagoras’s theorem. Using that, we obtain

ss~=ΔRiαi-ΔRi. (9)

The claimed estimate (7) follows by plugging this upper bound for s into (8) and rearranging. ∎

Proof of Theorem 3.
Proof of Theorem 3.

The fact that R(μr𝖫,i+1)>R(μr𝖫,i), which is equivalent to ΔRi>0, follows immediately from Proposition 1.

We now prove the claimed rate of convergence.

First, using Proposition 1, we note that the assumption that R¯𝖫-R(μr𝖫,i)>ε implies that

ε<ΔRi+DΔRiαi-ΔRi. (10)

Using that, we can conclude that

ΔRi>min{ε/2,εαmin/(4D2+ε2)}. (11)

Indeed, if ΔRiε2, it follows from (10) that we must have DΔRi/(αmin-ΔRi)>ε2, which implies ΔRi>εαmin/(4D2+ε2). Since we are interested in the behavior as ε0, we assume from now on that ε is so small that εαmin/(4D2+ε2)<ε/2, so that (11) becomes

ΔRi>εαmin/(4D2+ε2)=:C0. (12)

Second, we observe that

αi-ΔRi>αmin2=:C1 (13)

except in at most N:=2αmin(maxR|Ω-minR|Ω) teaching steps. To see that, note that if the claimed inequality, which is equivalent to αi-αmin2>ΔRi, does not hold, performance increases by at least ΔRiαmin2 as αi>αmin, and that can happen at most N times.

The inequalities (12) and (13) together imply that we have

C0C1(αi-ΔRi)ΔRi (14)

as long as R¯𝖫-R(μr𝖫,i)>ε, except in at most N teaching steps. Setting C:=1C0C1, this is equivalent to

ΔRiαi-ΔRiCΔRi (15)

Plugging (15) into the bound (7) provided by Proposition 1, we obtain the estimate

11+CD(R¯𝖫-R(μr𝖫,i))ΔRi. (16)

We have C=1εαmin2(4D2+ε2), and hence

11+CD=εαminεαmin+2(4D2+ε2)D11+10εαminD2=:λ (17)

If we had the estimates (16), (17) for all teaching steps, we could conclude that the learner performance satisfies R(μr𝖫,i)>R¯𝖫-2ε after at most O(D2εαminlogDε) teaching steps. One can see that e.g. by comparing the sequence R0,R1,R2, with the solution R(t) of the ordinary differential equation R˙=λ(R¯𝖫-R), which satifies R¯𝖫-R(t)=(R¯𝖫-R(0))exp(-λt). Since the number N of teaching steps for which (16), (17) do potentially not hold is O(Dαmin), we can still make this conclusion. ∎

Appendix C Background on (discounted) MCE-IRL Problem (Section 3)

Our learner models build on the (discounted) Maximum Causal Entropy (MCE) IRL framework [Ziebart et al., 2008, Ziebart, 2010, Ziebart et al., 2013, Zhou et al., 2018]. The results below are based on the MDCE-IRL formulation from [Zhou et al., 2018].

C.1 Primal problem

In the standard (discounted) MCE-IRL framework, a learning agent aims to identify a policy that matches the feature expectations of the teacher’s demonstrations while simultaneously maximizing the (discounted) causal entropy of the policy, i.e., the learner solves the following optimization problem:

maxπ Hγ(A0:S0:):=t=0γt𝔼[-logπ(atst)]
subject to μr(π)[i]=μ^r(Ξ𝖳)[i]i{1,2,,dr}.

Here, μr(π)[i] and μ^r(Ξ𝖳)[i] denote the scalar values of the ith reward feature. The idea is that without any further information beyond the teacher’s demonstrations, the most uncertain solution matching the reward feature expectation of those demonstrations should be preferred.

Formulating this as a minimization problem and spelling out all the constraints, we arrive at the following primal:

min𝝅={πt}t=0-Hγ(A0:S0:)
subject to
μr(πt)[i]=μ^r(Ξ𝖳)[i]i{1,2,,dr}
πt(a|s)0a𝒜,s𝒮,t0
a𝒜πt(a|s)=1s𝒮,t0
πt(a|s)=πt(a|s)a𝒜,s𝒮,t0,t0

The last condition ensures that the policy π is stationary.

C.2 Lagrangian relaxation

The Lagrangian relaxation optimization formulation of the above primal problem is given by

(𝝅,𝝀,𝝍) =-Hγ(A0:S0:)+𝝀(μ^r(Ξ𝖳)-μr(πt))+s,tψs,t(1-a𝒜πt(a|s))
subject to
πt(a|s)0a𝒜,s𝒮,t0
πt(a|s)=πt(a|s)a𝒜,s𝒮,t,t0

Here, 𝝀dr and 𝝍={ψs,t}st. Also, is the transpose operator defined for vectors.

Remark. The Lagrangian relaxation of the optimization problem is not convex in the problem variables because of the term 𝝀(μ^r(Ξ𝖳)-μr(πt)) in the objective function, which is not convex in the variables πt. However, it can be shown that strong duality holds for both its dual and primal formulations ([Zhou et al., 2018]). The dual formulation is described in Section C.4.

C.3 Parametric form of the policy

For a given 𝝀, the optimal policy π𝝀soft(a|s) is given by

π𝝀soft(a|s) =exp(Q𝝀soft(s,a))exp(V𝝀soft(s))

where the quantities are defined recursively as follows:

Q𝝀soft(s,a) =𝝀μr(π𝝀soft(a|s))+γs𝒮T(s|s,a)V𝝀soft(s)
V𝝀soft(s) = log a𝒜exp(Q𝝀soft(s,a))

This is shown by taking the derivative of the Lagrangian, (𝝅,𝝀,𝝍) w.r.t. the primal variables πt and equating it to 0, i.e.,

L({πt}t=0,𝝀,𝝍)πt =0.

For a given 𝝀, the corresponding softmax policy can be obtained by Soft-Value-Iteration procedure (see [Ziebart, 2010, Algorithm. 9.1], [Zhou et al., 2018]).

C.4 Dual problem

For any given 𝝀,𝝍, let g(𝝀,𝝍) be the optimal value for the optimization problem defined by the Lagrangian relaxation problem in Section C.2. As strong duality holds for the (discounted) MCE-IRL problem and its dual counter part, we solve only the following concave dual problem:

maximize𝝀dr,ψs,tg(𝝀,𝝍)

C.5 Gradients for the dual variables

As the dual problem is concave, it can be solved using gradient ascent. The gradients of the dual function described in Section C.4 are given by:

𝝀g =μ^r(Ξ𝖳)-μr(π𝝀soft)
ψs,tg =1-a𝒜π𝝀soft(a|s)

Here π𝝀soft is the parametric softmax policy described above. The second condition is automatically satisfied because π𝝀soft is a probability distribution.

The gradient update rule to compute the optimal 𝝀 is:

𝝀next𝝀-η(μr(π𝝀soft)-μ^r(Ξ𝖳))

where η is the learning rate.

Appendix D Details of (discounted) MCE-IRL Problem with Preferences (Section 3.2)

Here we present the background of the learner model described in Section 3.2. In this setting, the learner’s preferences are modeled as linear soft constraints with L1 penalties. We consider the minimization variant of the problem. The results in this section follow directly from the analysis of Maximum Entropy Models under different constraints, as presented in [Kazama and Tsujii, 2005, Dudík et al., 2007] when applied to (discounted) MCE-IRL problem [Ziebart et al., 2013, Zhou et al., 2018]. For brevity, redundant details of the derivations are omitted.
The final policy of the learner is given by π𝝀soft and is defined in Section D.3.

D.1 Primal problem

The primal problem is given by

min𝝅={πt}t=0;δrsoft,low,δrsoft,up,δcsoft,up0-Hγ(A0:||S0:)+i=1drCr(δrsoft,low[i]+δrsoft,up[i])+j=1dcCcδcsoft,up[j]
subject to
  μ^r(Ξ𝖳)[i]-μr(πt)[i]δrsoft,low[i]i{1,2,,dr}
  μr(πt)[i]-μ^r(Ξ𝖳)[i]δrsoft,up[i]i{1,2,,dr}
  μc(πt)[j]δchard[j]+δcsoft,up[j]j{1,2,,dc}

Here we have δrsoft,low,δrsoft,updr and δcsoft,updc as the primal optimization slack variables with the constraint that δrsoft,low,δrsoft,up,δcsoft,up0. We also have Cr>0,Cc>0. δcharddc is a given constant vector.

Remark. low and up in the superscripts of dual variables represent whether they are variables for lower bound constraints or upper bound constraints.

D.2 Lagrangian relaxation

The Lagrangian relaxation optimization formulation of the primal problem described in Section D.1 is given by

(𝝅,δrsoft,low,δrsoft,up,δcsoft,up,𝝀,𝝍) =-Hγ(A0:,S0:)+(𝜶low-𝜶up)(μ^r(Ξ𝖳)-μr(πt))
+𝜷μc(πt)
+s,tψs,t(1-a𝒜πt(a|s))-(𝜶low)δrsoft,low-(𝜶up)δrsoft,up
-𝜷δcsoft,up-𝜷δchard
-(𝝆low)δrsoft,low-(𝝆up)δrsoft,up
-𝝈δcsoft,up
+i=1drCr(δrsoft,low[i]+δrsoft,up[i])+j=1dcCcδcsoft,up[j]
subject to
πt(a|s)0a𝒜,s𝒮,t0
πt(a|s)=πt(a|s)a𝒜,s𝒮,t,t0

Here, 𝜶low,𝜶up,𝝆low,𝝆updr, and 𝜷,𝝈dc. We also have non-negativity constraints on the dual variables: 𝜶low,𝜶up,𝜷,𝝆low,𝝆up,𝝈0. A few additional notes:

  • For convenience, we will denote the group of dual variables as 𝝀:={𝜶low,𝜶up,𝜷,𝝆low,𝝆up,𝝈}

  • The reward parameter 𝒘𝝀=[(𝜶low-𝜶up),-𝜷] is used to define the learner’s reward function R𝝀(s)=𝒘𝝀,ϕ(s).

  • is the transpose operator, defined for vectors.

D.3 Parametric form of the policy

For a given, 𝝀:={𝜶low,𝜶up,𝜷,𝝆low,𝝆up,𝝈}, the optimal policy π𝝀soft(a|s) is given by

π𝝀soft(a|s) =exp(Q𝝀soft(s,a))exp(V𝝀soft(s))

where the quantities are defined recursively as follows:

Q𝝀soft(s,a) =(𝜶low-𝜶up)μr(π𝝀soft(a|s))-𝜷μc(π𝝀soft(a|s))+γs𝒮T(s|s,a)V𝝀soft(s)
V𝝀soft(s) = log (a𝒜exp(Q𝝀soft(s,a)))

This is shown by taking the derivative of the Lagrangian, (𝝅,𝝀,𝝍) w.r.t the primal variables, πt and equating it to 0. i.e.

L({πt}t=0,𝝀,𝝍)πt =0

D.4 Updated Lagrangian

We find the partial derivatives of the Lagrangian defined in Section D.2 w.r.t all the primal variables, δrsoft,low,δrsoft,up,δcsoft,up:

δrsoft,low[i] =0
αlow[i] =Cr-ρlow[i]
Also, δrsoft,up =0
αup[i] =Cr-ρup[i]
And, δcsoft,up =0
β[i] =Cr-σ[i]

The dual variables satisfy 𝝈,𝝆low,𝝆up0. Hence, the above conditions translate into the following constraints on the set of dual variables, 𝜶low,𝜶up,𝜷:

0αlow[i]Cri{1,2,,dr}
0αup[i]Cri{1,2,,dr}
0β[j]Ccj{1,2,,dc}

The updated Lagrangian now has these additional constraints and is given by:

(𝝅,δrsoft,low,δrsoft,up,δcsoft,up,𝝀,𝝍) =-Hγ(A0:,S0:)+(𝜶low-𝜶up)(μ^r(Ξ𝖳)-μr(πt))+𝜷μc(πt)
+s,tψs,t(1-a𝒜πt(a|s))-(𝜶low)δrsoft,low-(𝜶up)δrsoft,up
-𝜷δcsoft,up-𝜷δchard
-(𝝆low)δrsoft,low-(𝝆up)δrsoft,up
-𝝈δcsoft,up
+i=1drCr(δrsoft,low[i]+δrsoft,up[i])+j=1dcCcδcsoft,up[j]
subject to
πt(a|s)0a𝒜,s𝒮,t0
πt(a|s)=πt(a|s)a𝒜,s𝒮,t,t0
0αlow[i]Cri{1,2,,dr}
0αup[i]Cri{1,2,,dr}
0β[j]Ccj{1,2,,dc}

The set of dual variables becomes 𝝀:={𝜶low,𝜶up,𝜷} and 𝝍={ψs,t}st.

D.5 Dual problem

For any given 𝝀,𝝍, let g(𝝀,𝝍) be the optimal value for the Lagrangian relaxation problem. Strong Duality holds for both our primal and dual formulations, and the dual optimal policy is also optimal for the primal formulation. Hence, we solve the concave dual problem, given by

maximize𝜶low,𝜶updr,𝜷dc,ψs,t g(𝝀,𝝍)
subject to
0𝜶lowCr
0𝜶upCr
0𝜷Cc

where 𝝀:={𝜶low,𝜶up,𝜷}.

D.6 Gradients for the dual problem

As the dual problem is concave, it can be solved using gradient ascent.
Note that,

ψs,tg =1-a𝒜π𝝀soft(a|s)

Here π𝝀soft is the parametric softmax policy described above. This condition is automatically satisfied because π𝝀soft is a probability distribution. For the remaining dual variables, we have the following gradients:

𝜶low g =μ^r(Ξ𝖳)-μr(π𝝀soft)
𝜶upg =μr(π𝝀soft)-μ^r(Ξ𝖳)
𝜷 g =μc(π𝝀soft)

The (projected) gradient update rules to compute the optimal value of the dual variables (𝜶low,𝜶up,𝜷) are given by the following:

𝜶nextlow 𝜶low-η(μr(π𝝀soft)-μ^r(Ξ𝖳))
αnextlow[i] max(0,αnextlow[i])i{1,2,,dr}
αnextlow[i] min(Cr,αnextlow[i])i{1,2,,dr}
𝜶nextup 𝜶up-η(μ^r(Ξ𝖳)-μr(π𝝀soft))
αnextup[i] max(0,αnextup[i])i{1,2,,dr}
αnextup[i] min(Cr,αnextup[i])i{1,2,,dr}
𝜷next 𝜷-η(-μc(π𝝀soft))
βnext[j] max(0,βnext[j])j{1,2,,dc}
βnext[j] min(Cc,βnext[j])j{1,2,,dc}

where η is the learning rate.

Appendix E LP Formulation for the Teacher Aware-CMDP (Section 4.1)

The problem of finding optimal learner-aware teaching demonstrations for the learner in Section 3.1 with linear preferences can be formulated as the following linear program (based on the linear programming formulation for solving MDPs [De, 1960]):

maxz saz(s,a)𝐰r*,ϕr(s) (18)
s.t. az(s,a)=(1-γ)P0(s)+γsaT(s|s,a)z(s,a)s (19)
z(s,a)0s,a (20)
saz(s,a)ϕc(s)[j]δchard[j]j{1,2,,dc} (21)

Here z is a vector of discounted state-action frequencies and z(s,a) refers to state-action frequency for state s and action a. The constraints in (21) are the linear preference constraints. From the optimal solution of the LP, an optimal stochastic policy can be extracted by

π(s,a):=z(s,a)az(s,a). (22)

Appendix F Bi-Level Optimization Approach (Section 4.2)

We only show the formalism for the most general bi-level problem for learners with linear preferences.

F.1 Using Dual (discounted) MCE-IRL formulation for the learner model in Section 3.2

The basic bi-level optimization problem that we aim to solve is the following:

maxπ𝖳 R(π𝖫)
subject to π𝖫argmaxπIRL(π,μ(π𝖳)).

We will replace the lower-level problem, i.e., argmaxπIRL(π,μ(π𝖳)) with its Karush-Kuhn-Tucker conditions [Boyd and Vandenberghe, 2004, Sinha et al., 2018]. The lower-level problem in its dual formulation is given in Appendix D.5.

Omitting details and replacing R(π𝝀):=𝐰r*,μr(π𝝀), this yields problems of the following form:

max𝝀 𝐰r*,μr(π𝝀)
subject to:
0𝜶lowCr
0𝜶upCr
0𝜷Cc
μc(π𝝀)()δchard

where 𝝀:={𝜶low,𝜶up,𝜷}. Here π𝝀 corresponds to a softmax policy with a reward function R𝝀(s)=𝒘𝝀,ϕ(s) for 𝒘𝝀=[(𝜶low-𝜶up),-𝜷]. Thus, finding optimal demonstrations means optimization over softmax teaching policies while respecting the learner’s preferences.

F.1.1 Optimal solution

The cases of the above problem we can observe have to be solved separately and the best solution must be picked. That is, we find the following two solutions: (step i) 𝝀1*, and (step ii) 𝝀2*. Then pick the best 𝝀* in (step iii):

Step i: λ1* Compute optimal parameters 𝝀1* by solving the following problem:

maxλ 𝐰r*,μr(π𝝀)
subject to:
0𝜶lowCr
0𝜶upCr
0𝜷Cc
μc(π𝝀)δchard


Step ii: λ2* Compute optimal parameters 𝝀2* by solving the following problem:

max𝝀 𝐰r*,μr(π𝝀) (23)
subject to: (24)
0𝜶lowCr (25)
0𝜶upCr (26)
𝜷=Cc (27)
μc(π𝝀)δchard (28)

Step iii: λ* Pick the best solution as

𝝀*=argmax𝝀{𝝀1*,𝝀2*}𝐰r*,μr(π𝝀)

This provides the optimal policy for the teacher. The teacher then computes feature expectation of this policy and provide it to the learner.

F.2 Solving the above problem

We adopt a variant of the Frank-Wolfe algorithm [Jaggi, 2013] to solve the problems of the form:

max𝝀 R(π𝝀):=𝐰r*,μr(π𝝀) (29)
subject to: (30)
0𝜶lowCr (31)
0𝜶upCr (32)
0𝜷Cc (33)
μc(π𝝀)()δchard (34)

In particular, we take the following steps to optimize the teaching policy π𝝀:

  1. 1.

    Initialization. Find a feasible starting point 𝝀0

  2. 2.

    Optimization. For t=1,2,

    • Compute the gradient 𝒈t=[𝝀R(π𝝀)](𝝀t-1) of the objective at 𝝀t-1. In experiments we approximate the gradient using finite-differences.

    • Linearize the constraints μc(π𝝀)()δchard at 𝝀t-1 as 𝒃t+𝑨t(𝝀-𝝀t-1)()δchard, where 𝒃t=μc(π𝝀t-1) and 𝑨t=[𝝀μc(π𝝀)](𝝀t-1). Again, we employ finite-differences to approximate this linearization. Clearly, we can reuse computation from the gradient estimation of the objective here to reduce computational demands.

    • Solve the direction-finding subproblem (a linear problem):

      max𝜸 𝜸,𝒈t
      subject to:
      0𝜶lowCr
      0𝜶upCr
      0𝜷Cc
      𝒃t+𝑨t-1(𝝀-𝝀t-1)()δchard

      with optimal solution 𝜸t*. Assuming that the linear approximation of the constraints is accurate locally, the directional vector 𝒅t=𝜸t*-𝝀t-1 is an ascent direction.

    • Perform a line-search from 𝝀t-1 to 𝜸t* and let 𝝀t be the point that maximizes the line search.

    • Upon convergence, terminate the For loop.

Upon convergence of the algorithm, the teacher can use the final 𝝀t for teaching.

Remark. Observe that the above algorithm would reduce to the standard Frank-Wolfe algorithm with line-search in the case of linear inequalities only.