Abstract
Inverse reinforcement learning (IRL) enables an agent to learn complexbehavior by observing demonstrations from a (near)optimal policy. The typicalassumption is that the learner's goal is to match the teacher's demonstratedbehavior. In this paper, we consider the setting where the learner has its ownpreferences that it additionally takes into consideration. These preferencescan for example capture behavioral biases, mismatched worldviews, or physicalconstraints. We study two teaching approaches: learneragnostic teaching, wherethe teacher provides demonstrations from an optimal policy ignoring thelearner's preferences, and learneraware teaching, where the teacher accountsfor the learner's preferences. We design learneraware teaching algorithms andshow that significant performance improvements can be achieved overlearneragnostic teaching.
Quick Read (beta)
Learneraware Teaching: Inverse Reinforcement Learning with Preferences and Constraints
Abstract
Inverse reinforcement learning (IRL) enables an agent to learn complex behavior by observing demonstrations from a (near)optimal policy. The typical assumption is that the learner’s goal is to match the teacher’s demonstrated behavior. In this paper, we consider the setting where the learner has its own preferences that it additionally takes into consideration. These preferences can for example capture behavioral biases, mismatched worldviews, or physical constraints. We study two teaching approaches: learneragnostic teaching, where the teacher provides demonstrations from an optimal policy ignoring the learner’s preferences, and learneraware teaching, where the teacher accounts for the learner’s preferences. We design learneraware teaching algorithms and show that significant performance improvements can be achieved over learneragnostic teaching.
Learneraware Teaching: Inverse Reinforcement Learning with Preferences and Constraints
Sebastian Tschiatschek^{†}^{†}thanks: Authors contributed equally to this work. Microsoft Research [email protected] Ahana Ghosh^{1}^{1}footnotemark: 1 MPISWS [email protected] Luis Haug^{1}^{1}footnotemark: 1 ETH Zurich [email protected] Rati Devidze MPISWS [email protected] Adish Singla MPISWS [email protected]
noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\[email protected]
1 Introduction
Inverse reinforcement learning (IRL) enables a learning agent (learner) to acquire skills from observations of a teacher’s demonstrations. The learner infers a reward function explaining the demonstrated behavior and optimizes its own behavior accordingly. IRL has been studied extensively [abbeel2004apprenticeship, ratliff2006maximum, ziebart2010modeling, boularias2011relative, osa2018algorithmic] under the premise that the learner can and is willing to imitate the teacher’s behavior.
In realworld settings, however, a learner typically does not blindly follow the teacher’s demonstrations, but also has its own preferences and constraints. For instance, consider demonstrating to an autopilot of a selfdriving car how to navigate from A to B by taking the most fuelefficient route. These demonstrations might conflict with the preference of the autopilot to drive on highways in order to ensure maximum safety. Similarly, in robothuman interaction with the goal of teaching people how to cook, a teaching robot might demonstrate to a human user how to cook “roast chicken”, which could conflict with the preferences of the learner who is “vegetarian”. To give yet another example, consider a surgical training simulator which provides virtual demonstrations of expert behavior; a novice learner might not be confident enough to imitate a difficult procedure because of safety concerns. In all these examples, the learner might not be able to acquire useful skills from the teacher’s demonstrations.
In this paper, we formalize the problem of teaching a learner with preferences and constraints. First, we are interested in understanding the suboptimality of learneragnostic teaching, i.e., ignoring the learner’s preferences. Second, we are interested in designing learneraware teachers who account for the learner’s preferences and thus enable more efficient learning. To this end, we study a learner model with preferences and constraints in the context of the Maximum Causal Entropy (MCE) IRL framework [ziebart2010modeling, ziebart2013principle, zhou2018mdce]. This enables us to formulate the teaching problem as an optimization problem, and to derive and analyze algorithms for learneraware teaching. Our main contributions are:
 I

II
We analyze the problem of optimizing demonstrations for the learner when preferences are known to the teacher, and we propose a bilevel optimization approach to the problem (Section 4).

III
We propose strategies for adaptively teaching a learner with preferences unknown to the teacher, and we provide theoretical guarantees under natural assumptions (Section 5).

IV
We empirically show that significant performance improvements can be achieved by learneraware teachers as compared to learneragnostic teachers (Section 6).
2 Problem Setting
Environment. Our environment is described by a Markov decision process (MDP) $\mathcal{M}:=(\mathcal{S},\mathcal{A},T,\gamma ,{P}_{0},R)$. Here $\mathcal{S}$ and $\mathcal{A}$ denote finite sets of states and actions. $T:\mathcal{S}\times \mathcal{S}\times \mathcal{A}\to [0,1]$ describes the state transition dynamics, i.e., $T({s}^{\prime}s,a)$ is the probability of landing in state ${s}^{\prime}$ by taking action $a$ from state $s$. $\gamma \in (0,1)$ is the discounting factor. ${P}_{0}:\mathcal{S}\to [0,1]$ is an initial distribution over states. $R:\mathcal{S}\to \mathbb{R}$ is the reward function. We assume that there exists a feature map ${\varphi}_{r}:\mathcal{S}\to {[0,1]}^{{d}_{r}}$ such that the reward function is linear, i.e., $R(s)=\u27e8{\mathbf{w}}_{r}^{*},{\varphi}_{r}(s)\u27e9$ for some ${\mathbf{w}}_{r}^{*}\in {\mathbb{R}}^{{d}_{r}}$. Note that a bound of ${\parallel {\mathbf{w}}_{r}^{*}\parallel}_{1}\le 1$ ensures that $\leftR(s)\right\le 1$ for all $s$.
Basic definitions. A policy is a map $\pi :\mathcal{S}\times \mathcal{A}\to [0,1]$ such that $\pi (\cdot \mid s)$ is a probability distribution over actions for every state $s$. We denote by $\mathrm{\Pi}$ the set of all such policies. The performance measure for policies we are interested in is the expected discounted reward $R(\pi ):=\mathbb{E}\left({\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}R({s}_{t})\right)$, where the expectation is taken with respect to the distribution over trajectories $\xi =({s}_{0},{s}_{1},{s}_{2},\mathrm{\dots})$ induced by $\pi $ together with the transition probabilities $T$ and the initial state distribution ${P}_{0}$. A policy $\pi $ is optimal for the reward function $R$ if $\pi \in {\mathrm{arg}\mathrm{max}}_{{\pi}^{\prime}\in \mathrm{\Pi}}R({\pi}^{\prime})$, and we denote an optimal policy by ${\pi}^{*}$. Note that $R(\pi )=\u27e8{\mathbf{w}}_{r}^{*},{\mu}_{r}(\pi )\u27e9$, where ${\mu}_{r}:\mathrm{\Pi}\to {\mathbb{R}}^{{d}_{r}}$, $\pi \mapsto \mathbb{E}\left({\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}{\varphi}_{r}({s}_{t})\right)$, is the map taking a policy to its vector of (discounted) feature expectations. We denote by ${\mathrm{\Omega}}_{r}=\{{\mu}_{r}(\pi ):\pi \in \mathrm{\Pi}\}$ the image ${\mu}_{r}(\mathrm{\Pi})$ of this map. Note that the set ${\mathrm{\Omega}}_{r}\in {\mathbb{R}}^{{d}_{r}}$ is convex (see [ziebart2010modeling, Theorem 2.8] and [abbeel2004apprenticeship]), and also bounded due to the discounting factor $\gamma \in (0,1)$. For a finite collection of trajectories $\mathrm{\Xi}={\{{s}_{0}^{i},{s}_{1}^{i},{s}_{2}^{i},\mathrm{\dots}\}}_{i=1,2,\mathrm{\dots}}$ obtained by executing a policy $\pi $ in the MDP $\mathcal{M}$, we denote the empirical counterpart of ${\mu}_{r}(\pi )$ by ${\widehat{\mu}}_{r}(\mathrm{\Xi}):=\frac{1}{\left\mathrm{\Xi}\right}{\sum}_{i}{\sum}_{t}{\gamma}^{t}{\varphi}_{r}({s}_{t}^{i})$.
An IRL learner and a teacher. We consider a learner $\U0001d5ab$ implementing an inverse reinforcement learning (IRL) algorithm and a teacher $\U0001d5b3$. The teacher has access to the full MDP $\mathcal{M}$; the learner knows the MDP and the parametric form of reward function $R(s)=\u27e8{\mathbf{w}}_{r},{\varphi}_{r}(s)\u27e9$ but does not know the true reward parameter ${\mathbf{w}}_{r}^{*}$. The learner, upon receiving demonstrations from the teacher, outputs a policy ${\pi}^{\U0001d5ab}$ using its algorithm. The teacher’s objective is to provide a set of demonstrations ${\mathrm{\Xi}}^{\U0001d5b3}$ to the learner that ensures that the learner’s output policy ${\pi}^{\U0001d5ab}$ achieves high reward $R({\pi}^{\U0001d5ab})$.
The standard IRL algorithms are based on the idea of feature matching [abbeel2004apprenticeship, ziebart2010modeling, osa2018algorithmic]: The learner’s algorithm finds a policy ${\pi}^{\U0001d5ab}$ that matches the feature expectations of the received demonstrations, ensuring that ${\parallel {\mu}_{r}({\pi}^{\U0001d5ab}){\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3})\parallel}_{2}\le \u03f5$ where $\u03f5$ specifies a desired level of accuracy. In this standard setting, the learner’s primary goal is to imitate the teacher (via feature matching) and this makes the teaching process easy. In fact, the teacher just needs to provide a sufficiently rich pool of demonstrations ${\mathrm{\Xi}}^{\U0001d5b3}$ obtained by executing ${\pi}^{*}$, ensuring ${\parallel {\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3}){\mu}_{r}({\pi}^{*})\parallel}_{2}\le \u03f5$. This guarantees that ${\parallel {\mu}_{r}({\pi}^{\U0001d5ab}){\mu}_{r}({\pi}^{*})\parallel}_{2}\le 2\u03f5$. Furthermore, the linearity of rewards and ${\parallel {\mathbf{w}}_{r}^{*}\parallel}_{1}\le 1$ ensures that the learner’s output policy ${\pi}^{\U0001d5ab}$ satisfies $R({\pi}^{\U0001d5ab})\ge R({\pi}^{*})2\u03f5$.
Key challenges in teaching a learner with preference constraints. In this paper, we study a novel setting where the learner has its own preferences which it additionally takes into consideration when learning a policy ${\pi}^{\U0001d5ab}$ using teacher’s demonstrations. We formally specify our learner model in the next section; here we highlight the key challenges that arise in teaching such a learner. Given that the learner’s primary goal is no longer just imitating the teacher via feature matching, the learner’s output policy can be suboptimal with respect to the true reward even if it had access to ${\mu}_{r}({\pi}^{*})$, i.e., the feature expectation vector of an optimal policy ${\pi}^{*}$. Figure 1 provides an illustrative example to showcase the suboptimality of teaching when the learner has preferences and constraints. The key challenge that we address in this paper is that of designing a teaching algorithm that selects demonstrations while accounting for the learner’s preferences.
3 Learner Model
In this section we describe the learner models we consider, including different ways of defining preferences and constraints. First, we introduce some notation and definitions that will be helpful. We capture learner’s preferences via a feature map ${\varphi}_{c}:\mathcal{S}\to {[0,1]}^{{d}_{c}}$. We define $\varphi (s)$ as a concatenation of the two feature maps ${\varphi}_{r}(s)$ and ${\varphi}_{c}(s)$ given by ${[{\varphi}_{r}{(s)}^{\u2020},{\varphi}_{c}{(s)}^{\u2020}]}^{\u2020}$ and let $d={d}_{r}+{d}_{c}$. Similar to the map ${\mu}_{r}$, we define ${\mu}_{c}:\mathrm{\Pi}\to {\mathbb{R}}^{{d}_{c}}$, $\pi \mapsto \mathbb{E}\left({\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}{\varphi}_{c}({s}_{t})\right)$ and $\mu :\mathrm{\Pi}\to {\mathbb{R}}^{d}$, $\pi \mapsto \mathbb{E}\left({\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}\varphi ({s}_{t})\right)$. Similar to ${\mathrm{\Omega}}_{r}$, we define ${\mathrm{\Omega}}_{c}\subseteq {\mathbb{R}}^{{d}_{c}}$ and $\mathrm{\Omega}\subseteq {\mathbb{R}}^{d}$ as the images of the maps ${\mu}_{c}(\mathrm{\Pi})$ and $\mu (\mathrm{\Pi})$. Note that for any policy $\pi \in \mathrm{\Pi}$, we have $\mu (\pi )={[{\mu}_{r}{(\pi )}^{\u2020},{\mu}_{c}{(\pi )}^{\u2020}]}^{\u2020}$.
Standard (discounted) MCEIRL. Our learner models build on the (discounted) Maximum Causal Entropy (MCE) IRL framework [ziebart2008maximum, ziebart2010modeling, ziebart2013principle, zhou2018mdce]. In the standard (discounted) MCEIRL framework, a learning agent aims to identify a policy that matches the feature expectations of the teacher’s demonstrations while simultaneously maximizing the (discounted) causal entropy given by $H(\pi ):=H({\{{a}_{t}\}}_{t=0,1,\mathrm{\dots}}\parallel {\{{s}_{t}\}}_{t=0,1,\mathrm{\dots}}):={\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}\mathbb{E}\left[\mathrm{log}\pi ({a}_{t}\mid {s}_{t})\right]$. More background is provided in Appendix D.
Including preference constraints. The standard framework can be readily extended to include learner’s preferences in the form of constraints on the preference features ${\varphi}_{c}$. Clearly, the learner’s preferences can render exact matching of the teacher’s demonstrations infeasible and hence we relax this condition. To this end, we consider the following generic learner model:
$\underset{\pi ,{\delta}_{r}^{\text{soft}}\ge 0,{\delta}_{c}^{\text{soft}}\ge 0}{\mathrm{max}}$  $H(\pi ){C}_{r}\cdot {\parallel {\delta}_{r}^{\text{soft}}\parallel}_{p}{C}_{c}\cdot {\parallel {\delta}_{c}^{\text{soft}}\parallel}_{p}$  (1)  
s.t.  ${\mu}_{r}(\pi )[i]{\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3})[i]\le {\delta}_{r}^{\text{hard}}[i]+{\delta}_{r}^{\text{soft}}[i]\forall i\in \{1,2,\mathrm{\dots},{d}_{r}\}$  
$\mathrm{\hspace{1em}\hspace{1em}}\mathit{\hspace{1em}\hspace{1em}\hspace{1em}}{g}_{j}({\mu}_{c}(\pi ))\le {\delta}_{c}^{\text{hard}}[j]+{\delta}_{c}^{\text{soft}}[j]\forall j\in \{1,2,\mathrm{\dots},m\},$ 
Here, $g:{\mathbb{R}}^{{d}_{c}}\mapsto \mathbb{R}$ are $m$ convex functions representing preference constraints. The coefficients ${C}_{r}$ and ${C}_{c}$ are the learner’s parameters which quantify the relative importance of matching the teacher’s demonstrations and satisfying the learner’s preferences. The learner model is further characterized by parameters ${\delta}_{r}^{\text{hard}}[i]$ and ${\delta}_{c}^{\text{hard}}[j]$ (we will use the vector notation as ${\delta}_{r}^{\text{hard}}\in {\mathbb{R}}_{\ge 0}^{{d}_{r}}$ and ${\delta}_{c}^{\text{hard}}\in {\mathbb{R}}_{\ge 0}^{m}$). The optimization variables for the learner are given by $\pi $, ${\delta}_{r}^{\text{soft}}[i]$, and ${\delta}_{c}^{\text{soft}}[j]$ (we will use the vector notation as ${\delta}_{r}^{\text{soft}}\in {\mathbb{R}}_{\ge 0}^{{d}_{r}}$ and ${\delta}_{c}^{\text{soft}}\in {\mathbb{R}}_{\ge 0}^{m}$). These parameters (${\delta}_{r}^{\text{hard}}$, ${\delta}_{c}^{\text{hard}}$) and optimization variables (${\delta}_{r}^{\text{soft}}$, ${\delta}_{c}^{\text{soft}}$) characterize the following behavior:

•
While a mismatch of up to ${\delta}_{r}^{\text{hard}}$ between the learner’s and teacher’s reward feature expectations incurs no cost regarding the optimization objective, a mismatch larger than ${\delta}_{r}^{\text{hard}}$ incurs a cost of ${C}_{r}\cdot {\parallel {\delta}_{r}^{\text{soft}}\parallel}_{p}$.

•
Similarly, while a violation of up to ${\delta}_{c}^{\text{hard}}$ of the learner’s preference constraints incurs no cost regarding the optimization objective, a violation larger than ${\delta}_{c}^{\text{hard}}$ incurs a cost of ${C}_{c}\cdot {\parallel {\delta}_{c}^{\text{soft}}\parallel}_{p}$.
Next, we discuss two special instances of this generic learner model.
3.1 Learner Model with Hard Preference Constraints
It is instructive to study a special case of the abovementioned generic learner model. Let us consider the model in Eq. 1 with ${\delta}_{r}^{\text{hard}}=0,{\delta}_{c}^{\text{hard}}=0$, and a limiting case with ${C}_{r},{C}_{c}\gg 0$ such that the term $H(\pi )$ can be neglected. Now, if we additionally assume that ${C}_{c}\gg {C}_{r}$, the learner’s objective can be thought of as finding a policy $\pi $ that minimizes the ${L}^{p}$ norm distance to the reward feature expectations of the teacher’s demonstration while satisfying the constraints ${g}_{j}({\mu}_{c}(\pi ))\le 0\forall j\in \{1,2,\mathrm{\dots},m\}$. More formally, we study the following learner model given in Eq. 2 below:
$\underset{\pi}{\mathrm{min}}$  ${\parallel {\mu}_{r}(\pi ){\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3})\parallel}_{p}$  (2)  
s.t.  ${g}_{j}({\mu}_{c}(\pi ))\le 0\forall j\in \{1,2,\mathrm{\dots},m\}.$ 
To get a better understanding of the model, we can define the learner’s constraint set as ${\mathrm{\Omega}}^{\U0001d5ab}:=\{\mu :\mu \in \mathrm{\Omega}\text{s.t.}{g}_{j}({\mu}_{c})\le 0\forall j\in \{1,2,\mathrm{\dots},m\}\}$. Similar to ${\mathrm{\Omega}}^{\U0001d5ab}$, we define ${\mathrm{\Omega}}_{r}^{\U0001d5ab}\subseteq {\mathrm{\Omega}}_{r}$ where ${\mathrm{\Omega}}_{r}^{\U0001d5ab}$ is the projection of the set ${\mathrm{\Omega}}^{\U0001d5ab}$ to the subspaces ${\mathbb{R}}^{{d}_{r}}$. We can now rewrite the above optimization problem as ${\mathrm{min}}_{\pi :{\mu}_{r}(\pi )\in {\mathrm{\Omega}}_{r}^{\U0001d5ab}}{\parallel {\mu}_{r}(\pi ){\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3})\parallel}_{p}$. Hence, the learner’s behavior is given by:

(i)
Learner can match: When ${\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3})\in {\mathrm{\Omega}}_{r}^{\U0001d5ab}$, the learner outputs a policy ${\pi}^{\U0001d5ab}$ s.t. ${\mu}_{r}({\pi}^{\U0001d5ab})={\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3})$.

(ii)
Learner cannot match: Otherwise, the learner outputs a policy ${\pi}^{\U0001d5ab}$ such that ${\mu}_{r}({\pi}^{\U0001d5ab})$ is given by the ${L}^{p}$ norm projection of the vector ${\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3})$ onto the set ${\mathrm{\Omega}}_{r}^{\U0001d5ab}$.
Figure 1 provides an illustration of the behavior of this learner model. We will design learneraware teaching algorithms for this learner model in Section 4.1 and Section 5.
3.2 Learner Model with Soft Preference Constraints
Another interesting learner model that we study in this paper arises from the generic learner when we consider $m={d}_{c}$ number of boxtype linear constraints with ${g}_{j}({\mu}_{c}(\pi ))={\mu}_{c}(\pi )[j]\forall j\in \{1,2,\mathrm{\dots},{d}_{c}\}$. We consider an ${L}^{1}$ norm penalty on violation, and for simplicity we consider ${\delta}_{r}^{\text{hard}}[i]=0\forall i\in \{1,2,\mathrm{\dots},{d}_{r}\}$. In this case, the learner’s model is given by
$\underset{\pi ,{\delta}_{r}^{\text{soft}}\ge 0,{\delta}_{c}^{\text{soft}}\ge 0}{\mathrm{max}}$  $H(\pi ){C}_{r}\cdot {\parallel {\delta}_{r}^{\text{soft}}\parallel}_{1}{C}_{c}\cdot {\parallel {\delta}_{c}^{\text{soft}}\parallel}_{1}$  (3)  
s.t.  ${\mu}_{r}(\pi )[i]{\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3})[i]\le {\delta}_{r}^{\text{soft}}[i]\forall i\in \{1,2,\mathrm{\dots},{d}_{r}\}$  
$\mathrm{\hspace{1em}\hspace{1em}}\mathit{\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\u2006}{\mu}_{c}(\pi )[j]\le {\delta}_{c}^{\text{hard}}[j]+{\delta}_{c}^{\text{soft}}[j]\forall j\in \{1,2,\mathrm{\dots},{d}_{c}\},$ 
The solution to the above problem corresponds to a softmax policy with a reward function ${R}_{\bm{\lambda}}(s)=\u27e8{\bm{w}}_{\bm{\lambda}},\varphi (s)\u27e9$ where ${\bm{w}}_{\bm{\lambda}}\in {\mathbb{R}}^{d}$ is parametrized by $\bm{\lambda}$. The optimal parameters $\bm{\lambda}$ can be computed efficiently and the corresponding softmax policy is then obtained by SoftValueIteration procedure (see [ziebart2010modeling, Algorithm. 9.1], [zhou2018mdce]). Details are provided in Appendix E. We will design learneraware teaching algorithms for this learner model in Section 4.2.
4 Learneraware Teaching under Known Constraints
In this section, we analyze the setting when the teacher has full knowledge of the learner’s constraints.
4.1 A Learneraware Teacher for Hard Preferences: AwareCMDP
Here, we design a learneraware teaching algorithm when considering the learner from Section 3.1. Given that the teacher has full knowledge of the learner’s preferences, it can compute an optimal teaching policy by maximizing the reward over policies that satisfy the learner’s preference constraints, i.e., the teacher solves a constrainedMDP problem (see [de1960problemes, altman1999constrained]) given by
$\underset{\pi}{\mathrm{max}}$  $\u27e8{\mathbf{w}}_{r}^{*},{\mu}_{r}(\pi )\u27e9\mathit{\hspace{1em}}\text{s.t.}\mathit{\hspace{1em}}{\mu}_{r}(\pi )\in {\mathrm{\Omega}}_{r}^{\U0001d5ab}.$ 
We refer to an optimal solution of this problem as ${\pi}^{\text{aware}}$ and the corresponding teacher as AwareCMDP. We can make the following observation formalizing the value of learneraware teaching:
Theorem 1.
For simplicity, assume that the teacher can provide an exact feature expectation $\mu \mathit{}\mathrm{(}\pi \mathrm{)}$ of a policy instead of providing demonstrations to the learner. Then, the value of learneraware teaching is
$\underset{\pi \text{s.t.}{\mu}_{r}(\pi )\in {\mathrm{\Omega}}_{r}^{\U0001d5ab}}{\mathrm{max}}\u27e8{\mathbf{w}}_{r}^{*},{\mu}_{r}(\pi )\u27e9\u27e8{\mathbf{w}}_{r}^{*},{\text{Proj}}_{{\mathrm{\Omega}}_{r}^{\U0001d5ab}}\left({\mu}_{r}({\pi}^{*})\right)\u27e9\ge 0.$ 
When the set ${\mathrm{\Omega}}^{\U0001d5ab}$ is defined via a set of linear constraints, the above problem can be formulated as a linear program and solved exactly. Details are provided in Appendix F.
4.2 A Learneraware Teacher for Soft Preferences: AwareBiL
For the learner models in Section 3, the optimal learneraware teaching problem can be naturally formalized as the following bilevel optimization problem:
$\underset{{\pi}^{\U0001d5b3}}{\mathrm{max}}$  $R({\pi}^{\U0001d5ab})\mathit{\hspace{1em}}\text{s.t.}\mathit{\hspace{1em}}{\pi}^{\U0001d5ab}\in \mathrm{arg}\underset{\pi}{\mathrm{max}}\text{IRL}(\pi ,\mu ({\pi}^{\U0001d5b3})),$  (4) 
where $\text{IRL}(\pi ,\mu ({\pi}^{\U0001d5b3}))$ stands for the IRL problem solved by the learner given demonstrations from ${\pi}^{\U0001d5b3}$ and can include preferences of the learner (see Eq. 1 in Section 3).
There are many possibilities for solving this bilevel optimization problem—see for example [sinha2018review] for an overview. In this paper we adopted a singlelevel reduction approach to simplify the above bilevel optimization problem as this results in particularly intuitive optimiziation problems for the teacher. The basic idea of singlelevel reduction is to replace the lowerlevel problem, i.e., $\mathrm{arg}{\mathrm{max}}_{\pi}\text{IRL}(\pi ,\mu ({\pi}^{\U0001d5b3}))$, by the optimality conditions for that problem given by the KarushKuhnTucker conditions [boyd2004convex, sinha2018review]. For the learner model outlined in Section 3.2, these reductions take the following form (see Appendix G for details):
$\underset{\bm{\lambda}:=\{{\bm{\alpha}}^{\text{low}}\in {\mathbb{R}}^{{d}_{r}},{\bm{\alpha}}^{\text{up}}\in {\mathbb{R}}^{{d}_{r}},\bm{\beta}\in {\mathbb{R}}^{{d}_{c}}\}}{\mathrm{max}}$  $\u27e8{\mathbf{w}}_{r}^{*},{\mu}_{r}({\pi}_{\bm{\lambda}})\u27e9$  (5)  
s.t.  $0\le {\bm{\alpha}}^{\text{low}}\le {C}_{r}$  
$0\le {\bm{\alpha}}^{\text{up}}\le {C}_{r}$  
$\mathrm{\{}0\le \bm{\beta}\le {C}_{c}\mathtt{\text{AND}}{\mu}_{c}({\pi}_{\bm{\lambda}})\le {\delta}_{c}^{\text{hard}}\}$  $\mathtt{\text{OR}}\{\bm{\beta}={C}_{c}\mathtt{\text{AND}}{\mu}_{c}({\pi}_{\bm{\lambda}})\ge {\delta}_{c}^{\text{hard}}\}$ 
where ${\pi}_{\bm{\lambda}}$ corresponds to a softmax policy with a reward function ${R}_{\bm{\lambda}}(s)=\u27e8{\bm{w}}_{\bm{\lambda}},\varphi (s)\u27e9$ for ${\bm{w}}_{\bm{\lambda}}={[{({\bm{\alpha}}^{\text{low}}{\bm{\alpha}}^{\text{up}})}^{\u2020},{\bm{\beta}}^{\u2020}]}^{\u2020}$. Thus, finding optimal demonstrations means optimization over softmax teaching policies while respecting the learner’s preferences. To actually solve the above optimization problem and find good teaching policies, we use an approach inspired by the FrankWolfe algorithm [jaggi2013] detailed in Appendix G. We refer to a teacher implementing this approach as AwareBiL.
5 LearnerAware Teaching Under Unknown Constraints
In this section, we consider the more realistic and challenging setting in which the teacher $\U0001d5b3$ does not know the learner $\U0001d5ab$’s constraint set ${\mathrm{\Omega}}_{r}^{\U0001d5ab}$. Without feedback from $\U0001d5ab$, $\U0001d5b3$ can generally not do better than the agnostic teacher who simply ignores any constraints. We therefore assume that $\U0001d5b3$ and $\U0001d5ab$ interact in rounds as described by Algorithm 5. The two versions of the algorithm we describe in Sections 5.1 and 5.2 are obtained by specifying how $\U0001d5b3$ adapts the teaching policy in each round.
[H] {algorithmic}[1] \StateInitial teaching policy ${\pi}^{\U0001d5b3,0}$ (e.g., optimal policy ignoring any constraints) \Forround $i=0,1,2,\mathrm{\dots}$ \StateTeacher provides demonstrations with feature vector ${\mu}_{r}^{\U0001d5b3,i}$ using policy ${\pi}^{\U0001d5b3,i}$ \StateLearner upon receiving ${\mu}_{r}^{\U0001d5b3,i}$ computes a policy ${\pi}^{\U0001d5ab,i}$ with feature vector ${\mu}_{r}^{\U0001d5ab,i}$ \StateTeacher observes learner’s feature vector ${\mu}_{r}^{\U0001d5ab,i}$ and adapts the teaching policy \EndFor
In this section, we assume that $\U0001d5ab$ is as described in Section 3.1: Given demonstrations ${\mathrm{\Xi}}^{\U0001d5b3}$, $\U0001d5ab$ finds a policy ${\pi}^{\U0001d5ab}$ such that ${\mu}_{r}({\pi}^{\U0001d5ab})$ matches the ${L}^{2}$projection of ${\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3})$ onto ${\mathrm{\Omega}}_{r}^{\U0001d5ab}$. For the sake of simplifying the presentation and the analysis, we also assume that $\U0001d5ab$ and $\U0001d5b3$ can observe the exact feature expectations of their respective policies, e.g., ${\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3})={\mu}_{r}({\pi}^{\U0001d5b3})$ if ${\mathrm{\Xi}}^{\U0001d5b3}$ is sampled from ${\pi}^{\U0001d5b3}$.
5.1 An Adaptive Learneraware Teacher Using Volume Search: AdAwareVol
In our first adaptive teaching algorithm AdAwareVol, $\U0001d5b3$ maintains an estimate ${\widehat{\mathrm{\Omega}}}_{r}^{\U0001d5ab}\supset {\mathrm{\Omega}}_{r}^{\U0001d5ab}$ of the learner’s constraint set, which in each round gets updated by intersecting the current version with a certain affine halfspace, thus reducing the volume of ${\widehat{\mathrm{\Omega}}}_{r}^{\U0001d5ab}$. The new teaching policy is then any policy ${\pi}^{\U0001d5b3,i+1}$ which is optimal under the constraint that ${\mu}^{\U0001d5b3,i+1}\in {\widehat{\mathrm{\Omega}}}_{r}^{\U0001d5ab}$. The interaction ends as soon as ${\parallel {\mu}_{r}^{\U0001d5ab,i}{\mu}_{r}^{\U0001d5b3,i}\parallel}_{2}\le \u03f5$ for a threshold $\u03f5$. Details are provided in Appendix C.1.
Theorem 2.
Upon termination of AdAwareVol, $\mathrm{L}$’s output policy ${\pi}^{\mathrm{L}}$ satisfies $R\mathit{}\mathrm{(}{\pi}^{\mathrm{L}}\mathrm{)}\mathrm{\ge}R\mathit{}\mathrm{(}{\pi}^{\mathrm{\text{aware}}}\mathrm{)}\mathrm{}\u03f5$ for any policy ${\pi}^{\mathrm{\text{aware}}}$ which is optimal under $\mathrm{L}$’s constraints. For the special case that ${\mathrm{\Omega}}_{r}^{\mathrm{L}}$ is a polytope defined by $m$ linear inequalities, the algorithm terminates in $O\mathit{}\mathrm{(}{m}^{{d}_{r}}\mathrm{)}$ iterations.
5.2 An Adaptive Learneraware Teacher Using Line Search: AdAwareLin
In our second adaptive teaching algorithm, AdAwareLin, $\U0001d5b3$ adapts the teaching policy by performing a binary search on a line segment of the form $\{{\mu}_{r}^{\U0001d5ab,i}+\alpha {\mathbf{w}}_{r}^{*}\alpha \in [{\alpha}_{\mathrm{min}},{\alpha}_{\mathrm{max}}]\}\subset {\mathbb{R}}^{{d}_{r}}$ to find a vector ${\mu}_{r}^{\U0001d5b3,i+1}={\mu}_{r}^{\U0001d5ab,i}+{\alpha}_{i}{\mathbf{w}}_{r}^{*}$ that is the vector of feature expectations of a policy; here ${\alpha}_{\mathrm{max}}>{\alpha}_{\mathrm{min}}>0$ are fixed constants. If that is not successful, the teacher finds a teaching policy with ${\mu}_{r}^{\U0001d5b3,i+1}\in {\mathrm{arg}\mathrm{min}}_{{\mu}_{r}\in {\mathrm{\Omega}}_{r}}{\parallel {\mu}_{r}{\mu}_{r}^{\U0001d5ab,i}{\alpha}_{\mathrm{min}}{\mathbf{w}}_{r}^{*}\parallel}_{2}$. The following theorem analyzes the convergence of $\U0001d5ab$’s performance to ${\overline{R}}_{\U0001d5ab}:={\mathrm{max}}_{{\mu}_{r}\in {\mathrm{\Omega}}_{r}}R({\mu}_{r})$ under the assumption that $\U0001d5b3$’s search succeeds in every round. The proof and further details are provided in Appendix C.2.
Theorem 3.
Fix some $\epsilon \mathrm{>}\mathrm{0}$ and assume that there exists a constant ${\alpha}_{\mathrm{min}}\mathrm{>}\mathrm{0}$ such that, as long as ${\overline{R}}_{\mathrm{L}}\mathrm{}R\mathit{}\mathrm{(}{\mu}_{r}^{\mathrm{L}\mathrm{,}i}\mathrm{)}\mathrm{>}\epsilon $, the teacher can find a teaching policy ${\pi}^{\mathrm{T}\mathrm{,}i\mathrm{+}\mathrm{1}}$ satisfying ${\mu}_{r}^{\mathrm{T}\mathrm{,}i\mathrm{+}\mathrm{1}}\mathrm{=}{\mu}_{r}^{\mathrm{L}\mathrm{,}i}\mathrm{+}{\alpha}_{i}\mathit{}{\mathrm{w}}_{r}^{\mathrm{*}}$ for some ${\alpha}_{i}\mathrm{\ge}{\alpha}_{\mathrm{min}}$. Then the learner’s performance increases monotonically in each round of AdAwareLin, i.e., $R\mathit{}\mathrm{(}{\mu}_{r}^{\mathrm{L}\mathrm{,}i\mathrm{+}\mathrm{1}}\mathrm{)}\mathrm{>}R\mathit{}\mathrm{(}{\mu}_{r}^{\mathrm{L}\mathrm{,}i}\mathrm{)}$. Moreover, after at most $O\mathit{}\mathrm{(}\frac{{D}^{\mathrm{2}}}{\epsilon \mathit{}{\alpha}_{\mathrm{min}}}\mathit{}\mathrm{log}\mathit{}\frac{D}{\epsilon}\mathrm{)}$ teaching steps, the learner’s performance satisfies $R\mathit{}\mathrm{(}{\mu}_{r}^{\mathrm{L}\mathrm{,}i}\mathrm{)}\mathrm{>}{\overline{R}}_{\mathrm{L}}\mathrm{}\mathrm{2}\mathit{}\epsilon $. Here we abbreviate $D\mathrm{:=}\mathrm{diam}\mathit{}{\mathrm{\Omega}}_{r}$.
6 Experimental Evaluation
In this section we evaluate our teaching algorithms for different types of learners on the environment introduced in Figure 1. The environment we consider here has three types of reward objects, i.e., a “star" object with reward of $1.0$, a “plus" object with reward of $0.9$, and a “dot" object with reward of $0.2$. Two objects of each type are placed randomly on the grid such that there is always only a single object in each grid cell. The presence of an object of type “star”, “plus”, or “dot” in some state $s$ is encoded in the reward features ${\varphi}_{r}(s)$ by a binaryindicator for each type such that ${d}_{r}=3$. We use a discount factor of $\gamma =0.99$. Upon collecting an object, there is a $0.1$ probability of transiting to a terminal state.
Learner models. We consider a total of 5 different learners whose preferences can be described by distractors in the environment. Each learner prefers to avoid a certain subset of these distractors. There is a total of 4 of distractors: (i) two “green" distractors are randomly placed at a distance of 0cell and 1cell to the “star" objects, respectively; (ii) two “yellow" distractors are randomly placed at a distance of 1cell and 2cells to the “plus" objects, respectively, see Figure 1(a).
Through these distractors we define learners L1L5 as follows: (L1) no preference features (${d}_{c}=0$); (L2) two preference features (${d}_{c}=2$) such that ${\varphi}_{c}(s)[1]$ and ${\varphi}_{c}(s)[2]$ are binary indicators of whether there is a “green" distractor at a distance of 0cells or 1cell, respectively; (L3) four preference features (${d}_{c}=4$) such that ${\varphi}_{c}(s)[1],{\varphi}_{c}(s)[2]$ are as for L2, and ${\varphi}_{c}(s)[3]$ and ${\varphi}_{c}(s)[4]$ are binary indicators of whether there is a “green" distractor at a distance of 2cells or a “yellow” distractor at a distance of 0cells, respectively; (L4) five preference features (${d}_{c}=5$) such that ${\varphi}_{c}(s)[1],\mathrm{\dots},{\varphi}_{c}(s)[4]$ are as for L3, and ${\varphi}_{c}(s)[5]$ is a binary indicator whether there is a “yellow” distractor at a distance of 1cell; and (L5) six preference features (${d}_{c}=6$) such that ${\varphi}_{c}(s)[1],\mathrm{\dots},{\varphi}_{c}(s)[5]$ are as for L4, and ${\varphi}_{c}(s)[6]$ is a binary indicator whether there is a “yellow” distractor at a distance of 2cells.
The first row in Figure 2 shows an instance of the considered objectworlds and indicates the preference of the learners to avoid certain regions by the gray area.



6.1 Teaching under known constraints
In this section we consider learners with soft constraints from Section 3.2, with preference features as described above, and parameters ${C}_{r}=5$, ${C}_{c}=10$, and ${\delta}_{c}^{\text{hard}}=0$ (more experimental results for different values of ${C}_{r}$ and ${C}_{c}$ are provided in Appendix B.1). Our first results are presented in Figure 2. The second and third rows show the rewards inferred by the learners for demonstrations provided by a learneragnostic teacher who ignores any constraints (Agnostic) and the bilevel learneraware teacher (AwareBiL), respectively. We observe that Agnostic fails to teach the learner about objects’ positive rewards in cases where the learners’ preferences conflict with the position of the most rewarding objects (second row). In contrast, AwareBiL always successfully teaches the learners about rewarding objects that are compatible with the learners’ preferences (third row).
We also compare Agnostic and AwareBiL in terms of reward achieved by the learner after teaching for object worlds of size $10\times 10$ in Table 1. The numbers show the average reward over 10 randomly generated objectworlds. Note that AwareBiL has to solve a nonconvex optimization problem to find the optimal teaching policy, cf. Eq. 5. Because we use a gradientbased optimization approach, the teaching policies found can depend on the initial point for optimization. Hence, we always consider the following two initial points for optimization and select the teaching policy which results in a higher objective value: (i) all optimization variables in Eq. 5 are set to zero, and (ii) the optimization variables are initialized as ${\alpha}^{\text{low}}[i]=\mathrm{max}\{{w}_{\bm{\lambda}}[i],0\}$, ${\alpha}^{\text{up}}[i]=\mathrm{max}\{{w}_{\bm{\lambda}}[i],0\}$, and $\bm{\beta}=0$, where ${\bm{w}}_{\bm{\lambda}}$ is as inferred by the learner when taught by Agnostic and $i\in \{1,\mathrm{\dots},{d}_{r}\}$, cf. Section 3.2. From Table 1 we observe that a learner can learn better policies from a teacher that accounts for the learner’s preferences.
Learner (${C}_{r}\mathrm{=}\mathrm{5}\mathrm{,}{C}_{c}\mathrm{=}\mathrm{10}$)  

L1  L2  L3  L4  L5  
Teacher  Agnostic  $7.99\pm 0.02$  $0.01\pm 0.00$  $0.01\pm 0.00$  $0.01\pm 0.00$  $0.00\pm 0.00$ 
AwareBiL  $8.00\pm 0.02$  $7.20\pm 0.01$  $4.86\pm 0.30$  $3.15\pm 0.27$  $1.30\pm 0.07$ 
6.2 Teaching under unknown constraints
In this section we evaluate the teaching algorithms from Section 5. We consider the learner model from Section 3.1 that uses ${L}^{2}$projection to match reward feature expectations as studied in Section 5, cf. Eq. 2.^{1}^{1} 1 To implement the learner in Eq. 2, we approximated the learner’s projection onto the set ${\mathrm{\Omega}}_{r}^{\U0001d5ab}$ as follows: We implemented the learner based on the optimization problem given in Eq. 3 with a hard constraint on preferences and ${L}^{2}$ norm penalty on reward mismatch scaled with a large value of ${C}_{r}=20$. For modeling the hard constraints, we consider boxtype linear constraints with ${\delta}_{c}^{\text{hard}}[j]=2.5\forall j\in \{1,2,\mathrm{\dots},{d}_{c}\}$ for the preference features, cf. Eq. 3.
We study the learners L1, L2, and L3 with preferences corresponding to the first three objectworlds shown in Figure 1(a). We report the results for learner L2 below; results for learners L1 and L3 are deferred to the Appendix B.2.
In this context it is instructive to investigate how quickly these adaptive teaching strategies converge to the performance of a teacher who has full knowledge about the learner. Results comparing the adaptive teaching strategies (AdAwareVol and AdAwareLin) are shown in Figure 2(a). We can observe that both teaching strategies get close to the best possible performance under full knowledge about the learner (AwareCMDP). We also provide results showing the performance achieved by the adaptive teaching strategies on objectworlds of varying sizes, see Figure 2(b).
Note that the performance of AdAwareVol decreases slightly when teaching for more rounds, i.e., comparing the results after 3 teaching rounds and at the end of the teaching process. This is because of approximations when learner is computing the policy via projection, which in turn leads to errors on the teacher side when approximating ${\widehat{\mathrm{\Omega}}}_{r}^{\U0001d5ab}$ (refer to discussion in Footnote 1). In contrast, AdAwareLin performance always increases when teaching for more rounds.

7 Related Work
Our work is closely related to algorithmic machine teaching [goldman1995complexity, zhu2015machine, zhu_overview_2018], whose general goal is to design teaching algorithms that optimize the data that is provided to a learning algorithm. Most works in machine teaching so far focus on supervised learning tasks and assume that the learning algorithm is fully known to the teacher, see e.g., [zhu2013machine, singla2014near, liu2016teaching, mac_aodha_teaching_2018].
In the IRL setting, few works study how to provide maximally informative demonstrations to the learner, e.g., [cakmak2012algorithmic, danielbrown2018irl]. In contrast to our work, their teacher fully knows the learner model and provides the demonstrations without any adaptation to the learner. The question of how a teacher should adaptively react to a learner has been addressed by [singla2013actively, liu2017blackbox, chen2018understanding, melo2018interactive, DBLP:conf/aaai/YeoKSMAFDC19, hunziker2018teaching], but only in the supervised setting. In a recent work, [DBLP:conf/ijcai/KamalarubanDCS19] have studied the problem of adaptively teaching an IRL agent by providing an informative sequence of demonstrations. However, they assume that the teacher has full knowlege of the learner’s dynamics.
Within the area of IRL, there is a line of work on active learning approaches [cohn_comparing_2011, brown_riskaware_2018, brown2018efficient, kareem2018_repeated, cui_active_2018], which is related to our work. In contrast to us, they take the perspective of the learner who actively influences the demonstrations it receives. A few papers have addressed the problem that arises when the learner does not have full access to the reward features, e.g., [levine2010feature] and [haug_teaching_2018].
Our work is also loosely related to multiagent reinforcement learning. [dimitrakakis2017multi] studied the interaction between agents with misaligned models with a focus on the question of how to jointly optimize a policy. [ghosh19towardsrobust] studied the problem of designing robust AI agent that can interact with another agent of unknown type. However, these works do not tackle the problem of teaching an agent by demonstrations. Another related work is [hadfield2016cooperative] which studied the cooperation of agents who do not perfectly understand each other.
8 Conclusions and Outlook
In this paper we considered inverse reinforcement learning in the context of learners with preferences and constraints. In this setting, the learner does not only focus on matching the teacher’s demonstrated behavior but also takes its own preferences, e.g., behavioral biases or physical constraints, into account. We developed a theoretical framework for this setting, and proposed and studied algorithms for learneraware teaching in which the teacher accounts for the learner’s preferences for the cases of known and unknown preference constraints. We demonstrated significant performance improvements of our learneraware teaching strategies as compared to learneragnostic teaching both theoretically and empirically. Our theoretical framework and our proposed algorithms foster the application of IRL in realworld settings in which the learner does not blindly follow a teacher’s demonstrations.
There are several promising directions for future work, including but not limited to: The evaluation of our approach in machinehuman and humanmachine tasks; extensions of our approach to other learner models; approaches for learning efficiently from a learner’s point of view from a fixed set of (potentially suboptimal) demonstrations in the case of preference constraints.
Acknowledgements
This work was supported by Microsoft Research through its PhD Scholarship Programme.
References
Appendix A List of Appendices
In this section we provide a brief description of the content provided in the appendices of the paper.
 •
 •
 •
 •
 •
 •
Appendix B Experimental Evaluation: Additional Results (Section 6)
B.1 Teaching under known constraints (Section 6.1)
Additional results for teaching under known constraints are presented in Table 2. We observe that AwareBiL clearly outperforms Agnostic for most combinations of ${C}_{r}$ and ${C}_{c}$. Only for ${C}_{r}=10,{C}_{c}=1$, the teachers AwareBiL and Agnostic achieve similar performance because ${C}_{r}\gg {C}_{c}$, and hence the learner values achieving higher reward more than satisfying its preferences.
Learner (${C}_{r}\mathrm{=}\mathrm{5}\mathrm{,}{C}_{c}\mathrm{=}\mathrm{10}$)  

L1  L2  L3  L4  L5  
Teacher  Agnostic  $7.99\pm 0.02$  $0.01\pm 0.00$  $0.01\pm 0.00$  $0.01\pm 0.00$  $0.00\pm 0.00$ 
AwareBiL  $8.00\pm 0.02$  $7.20\pm 0.01$  $4.86\pm 0.30$  $3.15\pm 0.27$  $1.30\pm 0.07$ 
Learner (${C}_{r}\mathrm{=}\mathrm{10}\mathrm{,}{C}_{c}\mathrm{=}\mathrm{10}$)  

L1  L2  L3  L4  L5  
Teacher  Agnostic  $8.34\pm 0.01$  $0.17\pm 0.02$  $0.01\pm 0.00$  $0.01\pm 0.00$  $0.00\pm 0.00$ 
AwareBiL  $8.33\pm 0.01$  $6.90\pm 0.17$  $5.03\pm 0.31$  $3.27\pm 0.28$  $1.35\pm 0.07$ 
Learner (${C}_{r}\mathrm{=}\mathrm{10}\mathrm{,}{C}_{c}\mathrm{=}\mathrm{5}$)  

L1  L2  L3  L4  L5  
Teacher  Agnostic  $8.36\pm 0.01$  $8.14\pm 0.03$  $0.01\pm 0.00$  $0.01\pm 0.00$  $0.00\pm 0.00$ 
AwareBiL  $8.34\pm 0.01$  $8.13\pm 0.03$  $5.20\pm 0.29$  $3.43\pm 0.27$  $1.69\pm 0.0$ 
Learner (${C}_{r}=5,{C}_{c}=5$)  

L1  L2  L3  L4  L5  
Teacher  Agnostic  $7.99\pm 0.02$  $0.17\pm 0.02$  $0.01\pm 0.00$  $0.01\pm 0.00$  $0.00\pm 0.00$ 
AwareBiL  $8.00\pm 0.02$  $6.64\pm 0.17$  $4.87\pm 0.30$  $3.16\pm 0.27$  $1.31\pm 0.06$ 
Learner (${C}_{r}\mathrm{=}\mathrm{10}\mathrm{,}{C}_{c}\mathrm{=}\mathrm{1}$)  

L1  L2  L3  L4  L5  
Teacher  Agnostic  $8.36\pm 0.01$  $8.39\pm 0.02$  $8.46\pm 0.02$  $8.46\pm 0.02$  $8.49\pm 0.02$ 
AwareBiL  $8.33\pm 0.01$  $8.36\pm 0.03$  $8.44\pm 0.02$  $8.44\pm 0.02$  $8.46\pm 0.02$ 
Learner (${C}_{r}\mathrm{=}\mathrm{1}\mathrm{,}{C}_{c}\mathrm{=}\mathrm{10}$)  

L1  L2  L3  L4  L5  
Teacher  Agnostic  $5.67\pm 0.02$  $0.15\pm 0.02$  $0.16\pm 0.02$  $0.11\pm 0.01$  $0.08\pm 0.01$ 
AwareBiL  $5.93\pm 0.02$  $4.49\pm 0.15$  $3.56\pm 0.24$  $2.30\pm 0.22$  $0.93\pm 0.05$ 
B.2 Teaching under unknown constraints (Section 6.2)
Here, we provide additional experimental results for teaching algorithms from Section 5. In particular, we report on the results for learner L1 and learner L3, similar to the results for learner L2 reported in Section 6.2.


Appendix C Details for LearnerAware Teaching under Unknown Constraints (Section 5)
In this appendix, we provide more details on the adaptive teaching algorithms AdAwareVol and AdAwareLin described in Sections 5.1 and 5.2. Recall that both teaching algorithms are obtained from Algorithm 5 by defining the way in which the teacher $\U0001d5b3$ adapts the teaching policy based on the learner $\U0001d5ab$’s feature expectations ${\mu}_{r}^{\U0001d5ab}$ in past rounds.
C.1 Details for AdAwareVol (Section 5.1)
Estimation of the learner’s constraint set.
In AdAwareVol, $\U0001d5b3$ maintains an estimate ${\widehat{\mathrm{\Omega}}}_{r}^{\U0001d5ab,i}$ of $\U0001d5ab$’s constraint set, starting with ${\widehat{\mathrm{\Omega}}}_{r}^{\U0001d5ab,0}={\mathrm{\Omega}}_{r}$. After observing the feature expectations ${\mu}_{r}^{\U0001d5ab,i}$ of the policy $\U0001d5ab$ found in round $i$, $\U0001d5b3$ updates this estimate as follows:
$${\widehat{\mathrm{\Omega}}}_{r}^{\U0001d5ab,i+1}:={\widehat{\mathrm{\Omega}}}_{r}^{\U0001d5ab,i}\cap \{{\mu}_{r}^{\U0001d5ab,i}+\nu \in {\mathbb{R}}^{{d}_{r}}\u27e8{\mu}_{r}^{\U0001d5b3,i}{\mu}_{r}^{\U0001d5ab,i},\nu \u27e9\le 0\}$$  (6) 
The set on the right hand side of (6) with which ${\mathrm{\Omega}}_{r}^{\U0001d5ab,i}$ gets intersected is a halfspace containing ${\mathrm{\Omega}}_{r}^{\U0001d5ab}$. This is due to the fact that ${\mathrm{\Omega}}_{r}^{\U0001d5ab}$ is convex by assumption, and to our assumption that $\U0001d5ab$’s learning algorithm is such that it outputs a policy whose feature expectations ${\mu}_{r}^{\U0001d5ab,i}$ match the ${L}^{2}$projection of ${\mu}_{r}^{\U0001d5b3,i}$ to ${\mathrm{\Omega}}_{r}^{\U0001d5ab}$. Inductively, it follows that ${\widehat{\mathrm{\Omega}}}_{r}^{\U0001d5ab,i}\supset {\mathrm{\Omega}}_{r}^{\U0001d5ab}$ for all $i$.
In practice, we implement a slightly modified version of the update step in which we intersect ${\widehat{\mathrm{\Omega}}}_{r}^{\U0001d5ab,i}$ with a halfspace that is shifted in the direction of ${\mu}_{r}^{\U0001d5b3,i}{\mu}_{r}^{\U0001d5ab,i}$ by a small amount, i.e., we use
$$\{{\mu}_{r}^{\U0001d5ab,i}+(1\eta )({\mu}_{r}^{\U0001d5b3,i}{\mu}_{r}^{\U0001d5ab,i})+\nu \in {\mathbb{R}}^{{d}_{r}}\u27e8{\mu}_{r}^{\U0001d5b3,i}{\mu}_{r}^{\U0001d5ab,i},\nu \u27e9\le 0\}$$ 
with a step size parameter $\eta \in (0,1)$. This helps make the algorithm more robust to noise in the learner’s feature expectations. In our experiments, we used $\eta =0.9$.
Update of the teaching policy.
After updating the estimate of the learner’s constraint set to ${\widehat{\mathrm{\Omega}}}_{r}^{\U0001d5ab,i}$, $\U0001d5b3$ solves a constrained MDP in order to find
$${\pi}^{\U0001d5b3,i+1}\in \underset{\pi ,{\mu}_{r}(\pi )\in {\widehat{\mathrm{\Omega}}}_{r}^{\U0001d5ab,i}}{\mathrm{arg}\mathrm{max}}R(\pi ).$$ 
Given that ${\widehat{\mathrm{\Omega}}}_{r}^{\U0001d5ab,i}$ is cut out by linear equations, solving the constrained MDP reduces to solving an LP, as described in Appendix F.
Termination of the interaction.
The algorithm terminates as soon as the stopping criterion ${\parallel {\mu}_{r}^{\U0001d5ab,i}{\mu}_{r}^{\U0001d5b3,i}\parallel}_{2}\le \u03f5$ is satisfied. Note that ${\widehat{\mathrm{\Omega}}}_{r}^{\U0001d5ab,i}\supset {\mathrm{\Omega}}_{r}^{\U0001d5ab}$ implies that
$$R({\pi}^{\U0001d5b3,i})\ge R({\pi}^{\text{aware}})$$ 
for any ${\pi}^{\text{aware}}\in {\mathrm{arg}\mathrm{max}}_{\pi ,{\mu}_{r}(\pi )\in {\mathrm{\Omega}}_{r}^{\U0001d5ab}}R(\pi )$. Therefore, after termination we have
$$R({\pi}^{\U0001d5ab,i})\ge R({\pi}^{\text{aware}})\u03f5$$ 
for any policy ${\pi}^{\text{aware}}$ which is optimal under $\U0001d5ab$’s constraints, which is the first statement of Theorem 2.
The second statement of Theorem 2 follows from the fact that if ${\mathrm{\Omega}}_{r}^{\U0001d5ab}$ is a convex polytope cut out by $m$ linear inequalities, the number of faces, which is in $O({m}^{{d}_{r}})$, is an upper bound on the number of iterations of the algorithm, because one face is “eliminated” in each round.
C.2 Details for AdAwareLin (Section 5.2)
In AdAwareLin, $\U0001d5b3$ updates the teaching policy ${\pi}^{\U0001d5b3,i+1}$ based on $\U0001d5ab$’s feature expectations ${\mu}_{r}^{\U0001d5ab,i}$ from the previous round. To do so, $\U0001d5b3$ uses LineSearch (Algorithm C.2) to perform a binary search on the line segment
$$\{{\mu}_{r}^{\U0001d5ab,i}+\alpha {\mathbf{w}}_{r}^{*}\alpha \in [{\alpha}_{\mathrm{min}},{\alpha}_{\mathrm{max}}]\}\subset {\mathbb{R}}^{{d}_{r}}$$  (7) 
in order to find a vector ${\mu}_{r}$ that is realizable as the vector of feature expectations of a policy. If the intersection of the line segment (7) with ${\mathrm{\Omega}}_{r}$ is nonempty, it is of the form $\{{\mu}_{r}^{\U0001d5ab}+\alpha {\mathbf{w}}_{r}^{*}\alpha \in [{\alpha}_{\mathrm{min}},{\alpha}^{*}]\}$ for some ${\alpha}^{*}\le {\alpha}_{\mathrm{max}}$ due to the convexity of ${\mathrm{\Omega}}_{r}$. In that case, LineSearch returns a policy with feature expectations
$${\mu}_{r}^{\U0001d5b3,i+1}={\mu}_{r}^{\U0001d5ab,i}+{\alpha}_{i}^{*}{\mathbf{w}}_{r}^{*},$$ 
where ${\alpha}_{i}^{*}$ is the maximal $\alpha \in [{\alpha}_{\mathrm{min}},{\alpha}_{\mathrm{max}}]$ such that ${\mu}_{r}^{\U0001d5ab,i}+\alpha {\mathbf{w}}_{r}^{*}\in {\mathrm{\Omega}}_{r}$. If the intersection is empty, LineSearch returns a policy with feature expectations
$${\mu}_{r}^{\U0001d5b3,i+1}\in \underset{{\mu}_{r}\in {\mathrm{\Omega}}_{r}}{\mathrm{arg}\mathrm{min}}{\parallel {\mu}_{r}{\mu}_{r}^{\U0001d5ab,i}{\alpha}_{\mathrm{min}}{\mathbf{w}}_{r}^{*}\parallel}_{2}.$$ 
Figure 7 illustrates the two cases that may occur.
C.2.1 Proof of Theorem 3
In this section, we provide the proof of Theorem 3, which gives a guarantee on the improvement of $\U0001d5ab$’s performance in each round of the AdAwareLin algorithm. The assumption we make here is that, in every teaching round, LineSearch returns a teaching policy ${\pi}^{\U0001d5b3,i+1}$ such that ${\mu}_{r}^{\U0001d5b3,i+1}={\mu}_{r}^{\U0001d5ab,i}+{\alpha}_{i}{\mathbf{w}}_{r}^{*}$ for some ${\alpha}_{i}\ge {\alpha}_{\mathrm{min}}$, where ${\alpha}_{\mathrm{min}}>0$ is a fixed constant. It is easy to see that this assumption, together with our assumption on $\U0001d5ab$’s algorithm and the convexity of ${\mathrm{\Omega}}_{r}^{\U0001d5ab}$, imply that the change in learner performance
$$\mathrm{\Delta}{R}_{i}:=R({\mu}_{r}^{\U0001d5ab,i+1})R({\mu}_{r}^{\U0001d5ab,i})$$ 
is nonnegative in every teaching round. The following proposition, which will be needed in the proof of Theorem 3, strengthens this statement:
Proposition 1.
Let ${\overline{R}}_{\mathrm{L}}\mathrm{:=}{\mathrm{max}}_{{\mu}_{r}\mathrm{\in}{\mathrm{\Omega}}_{r}^{\mathrm{L}}}\mathit{}R\mathit{}\mathrm{(}{\mu}_{r}\mathrm{)}$ be the maximally achievable learner performance. Assume that, in teaching round $i$, $\mathrm{T}$ can find a teaching policy ${\pi}^{\mathrm{T}\mathrm{,}i\mathrm{+}\mathrm{1}}$ whose feature expectations satisfy ${\mu}_{r}^{\mathrm{T}\mathrm{,}i\mathrm{+}\mathrm{1}}\mathrm{=}{\mu}_{r}^{\mathrm{L}\mathrm{,}i}\mathrm{+}{\alpha}_{i}\mathit{}{\mathrm{w}}_{r}^{\mathrm{*}}$ for some ${\alpha}_{i}\mathrm{>}\mathrm{0}$. Then
$${\overline{R}}_{\U0001d5ab}R({\mu}_{r}^{\U0001d5ab,i})\le \mathrm{\Delta}{R}_{i}+D\cdot \sqrt{\frac{\mathrm{\Delta}{R}_{i}}{{\alpha}_{i}\mathrm{\Delta}{R}_{i}}},$$  (8) 
where $D\mathrm{=}\mathrm{diam}\mathit{}{\mathrm{\Omega}}_{r}$.
Proof of Proposition 1.
Consider the plane $V\subset {\mathbb{R}}^{{d}_{r}}$ spanned by ${\mu}_{r}^{\U0001d5ab,i},{\mu}_{r}^{\U0001d5b3,i+1}$ and ${\mu}_{r}^{\U0001d5ab,i+1}$ and denote by ${\stackrel{~}{\mu}}_{r}$ the unique point in $V$ with the properties that

(a)
$\u27e8{\mathbf{w}}_{r}^{*},{\stackrel{~}{\mu}}_{r}\u27e9=\u27e8{\mathbf{w}}_{r}^{*},{\mu}_{r}^{\U0001d5ab,i+1}\u27e9$,

(b)
${\stackrel{~}{\mu}}_{r}$ lies on the same side of the line through ${\mu}^{\U0001d5ab,i}$ and ${\mu}^{\U0001d5b3,i+1}$ as ${\mu}_{r}^{\U0001d5ab,i+1}$, and

(c)
${\stackrel{~}{\mu}}_{r},{\mu}_{r}^{\U0001d5b3,i+1}$ and ${\mu}_{r}^{\U0001d5ab,i}$ span a right triangle with ${\stackrel{~}{\mu}}_{r}$ at the rightangled corner.
Note that ${\mu}_{r}^{\U0001d5ab,i+1}$ must lie inside this triangle, i.e., on the red line segment in Figure 8: Otherwise there would a point on the line segment connecting ${\mu}_{r}^{\U0001d5ab,i+1}$ and ${\mu}_{r}^{\U0001d5ab,i}$, and hence in ${\mathrm{\Omega}}_{r}^{\U0001d5ab}$ by convexity, which is closer to ${\mu}_{r}^{\U0001d5b3,i+1}$ than ${\mu}_{r}^{\U0001d5ab,i+1}$, contradicting the fact that ${\mu}_{r}^{\U0001d5ab,i+1}$ is closest to ${\mu}_{r}^{\U0001d5b3,i+1}$ among all points in ${\mathrm{\Omega}}_{r}^{\U0001d5ab}$. Denote by $\stackrel{~}{\mathrm{\ell}}$ the line passing through ${\stackrel{~}{\mu}}_{r}$ and ${\mu}_{r}^{\U0001d5ab,i}$.
The facts that ${\mathrm{\Omega}}_{r}^{\U0001d5ab}$ is convex and that ${\mu}_{r}^{\U0001d5ab,i+1}={\mathrm{arg}\mathrm{min}}_{{\mu}_{r}\in {\mathrm{\Omega}}_{r}^{\U0001d5ab}}{\parallel {\mu}_{r}^{\U0001d5b3,i+1}{\mu}_{r}\parallel}_{2}$ imply that ${\mathrm{\Omega}}_{r}^{\U0001d5ab}$ must lie on one side of the hyperplane
$${\mu}_{r}^{\U0001d5ab,i+1}+{({\mu}_{r}^{\U0001d5b3,i+1}{\mu}_{r}^{\U0001d5ab,i+1})}^{\u27c2}\subset {\mathbb{R}}^{{d}_{r}}.$$ 
Therefore, we can upper bound ${\overline{R}}_{\U0001d5ab}$ in terms of the slope ${s}_{\mathrm{\ell}}$ of the line $\mathrm{\ell}$ which arises by intersecting that hyperplane with $V$:
$${\overline{R}}_{\U0001d5ab}\le R({\mu}_{r}^{\U0001d5ab,i+1})+D\cdot {s}_{\mathrm{\ell}}=R({\mu}_{r}^{\U0001d5ab,i})+\mathrm{\Delta}{R}_{i}+D\cdot {s}_{\mathrm{\ell}}.$$  (9) 
Note that the slope ${s}_{\mathrm{\ell}}$ is upper bounded by the slope ${s}_{\stackrel{~}{\mathrm{\ell}}}$ of $\stackrel{~}{\mathrm{\ell}}$. We have ${s}_{\stackrel{~}{\mathrm{\ell}}}=\frac{\mathrm{\Delta}{R}_{i}}{h}$, where $h$ is the length of the red line segment in Figure 8, and $h=\sqrt{({\alpha}_{i}\mathrm{\Delta}{R}_{i})\mathrm{\Delta}{R}_{i}}$ by Pythagoras’s theorem. Using that, we obtain
$${s}_{\mathrm{\ell}}\le {s}_{\stackrel{~}{\mathrm{\ell}}}=\sqrt{\frac{\mathrm{\Delta}{R}_{i}}{{\alpha}_{i}\mathrm{\Delta}{R}_{i}}}.$$  (10) 
The claimed estimate (8) follows by plugging this upper bound for $s$ into (9) and rearranging. ∎
Proof of Theorem 3.
Proof of Theorem 3.
The fact that $R({\mu}_{r}^{\U0001d5ab,i+1})>R({\mu}_{r}^{\U0001d5ab,i})$, which is equivalent to $\mathrm{\Delta}{R}_{i}>0$, follows immediately from Proposition 1.
We now prove the claimed rate of convergence.
First, using Proposition 1, we note that the assumption that ${\overline{R}}_{\U0001d5ab}R({\mu}_{r}^{\U0001d5ab,i})>\epsilon $ implies that
$$  (11) 
Using that, we can conclude that
$$\sqrt{\mathrm{\Delta}{R}_{i}}>\mathrm{min}\{\sqrt{\epsilon /2},\epsilon \sqrt{{\alpha}_{\mathrm{min}}/(4{D}^{2}+{\epsilon}^{2})}\}.$$  (12) 
Indeed, if $\mathrm{\Delta}{R}_{i}\le \frac{\epsilon}{2}$, it follows from (11) that we must have $D\cdot \sqrt{\mathrm{\Delta}{R}_{i}/({\alpha}_{\mathrm{min}}\mathrm{\Delta}{R}_{i})}>\frac{\epsilon}{2}$, which implies $\sqrt{\mathrm{\Delta}{R}_{i}}>\epsilon \sqrt{{\alpha}_{\mathrm{min}}/(4{D}^{2}+{\epsilon}^{2})}$. Since we are interested in the behavior as $\epsilon \to 0$, we assume from now on that $\epsilon $ is so small that $$, so that (12) becomes
$$\sqrt{\mathrm{\Delta}{R}_{i}}>\epsilon \sqrt{{\alpha}_{\mathrm{min}}/(4{D}^{2}+{\epsilon}^{2})}=:{C}_{0}.$$  (13) 
Second, we observe that
$$\sqrt{{\alpha}_{i}\mathrm{\Delta}{R}_{i}}>\sqrt{\frac{{\alpha}_{\mathrm{min}}}{2}}=:{C}_{1}$$  (14) 
except in at most $N:=\frac{2}{{\alpha}_{\mathrm{min}}}({\mathrm{max}R}_{\mathrm{\Omega}}{\mathrm{min}R}_{\mathrm{\Omega}})$ teaching steps. To see that, note that if the claimed inequality, which is equivalent to ${\alpha}_{i}\frac{{\alpha}_{\mathrm{min}}}{2}>\mathrm{\Delta}{R}_{i}$, does not hold, performance increases by at least $\mathrm{\Delta}{R}_{i}\ge \frac{{\alpha}_{\mathrm{min}}}{2}$ as ${\alpha}_{i}>{\alpha}_{\mathrm{min}}$, and that can happen at most $N$ times.
The inequalities (13) and (14) together imply that we have
$${C}_{0}\cdot {C}_{1}\le \sqrt{({\alpha}_{i}\mathrm{\Delta}{R}_{i})\mathrm{\Delta}{R}_{i}}$$  (15) 
as long as ${\overline{R}}_{\U0001d5ab}R({\mu}_{r}^{\U0001d5ab,i})>\epsilon $, except in at most $N$ teaching steps. Setting $C:=\frac{1}{{C}_{0}\cdot {C}_{1}}$, this is equivalent to
$$\sqrt{\frac{\mathrm{\Delta}{R}_{i}}{{\alpha}_{i}\mathrm{\Delta}{R}_{i}}}\le C\mathrm{\Delta}{R}_{i}$$  (16) 
Plugging (16) into the bound (8) provided by Proposition 1, we obtain the estimate
$$\frac{1}{1+CD}({\overline{R}}_{\U0001d5ab}R({\mu}_{r}^{\U0001d5ab,i}))\le \mathrm{\Delta}{R}_{i}.$$  (17) 
We have $C=\frac{1}{\epsilon {\alpha}_{\mathrm{min}}}\sqrt{2(4{D}^{2}+{\epsilon}^{2})}$, and hence
$$\frac{1}{1+CD}=\frac{\epsilon {\alpha}_{\mathrm{min}}}{\epsilon {\alpha}_{\mathrm{min}}+\sqrt{2(4{D}^{2}+{\epsilon}^{2})}\cdot D}\ge \frac{1}{1+\sqrt{10}}\frac{\epsilon {\alpha}_{\mathrm{min}}}{{D}^{2}}=:\lambda $$  (18) 
If we had the estimates (17), (18) for all teaching steps, we could conclude that the learner performance satisfies $R({\mu}_{r}^{\U0001d5ab,i})>{\overline{R}}_{\U0001d5ab}2\epsilon $ after at most $O(\frac{{D}^{2}}{\epsilon {\alpha}_{\mathrm{min}}}\mathrm{log}\frac{D}{\epsilon})$ teaching steps. One can see that e.g. by comparing the sequence ${R}_{0},{R}_{1},{R}_{2},\mathrm{\dots}$ with the solution $R(t)$ of the ordinary differential equation $\dot{R}=\lambda ({\overline{R}}_{\U0001d5ab}R)$, which satifies ${\overline{R}}_{\U0001d5ab}R(t)=({\overline{R}}_{\U0001d5ab}R(0))\mathrm{exp}(\lambda t)$. Since the number $N$ of teaching steps for which (17), (18) do potentially not hold is $O(\frac{D}{{\alpha}_{\mathrm{min}}})$, we can still make this conclusion. ∎
Appendix D Background on (discounted) MCEIRL Problem (Section 3)
Our learner models build on the (discounted) Maximum Causal Entropy (MCE) IRL framework [ziebart2008maximum, ziebart2010modeling, ziebart2013principle, zhou2018mdce]. The results below are based on the MDCEIRL formulation from [zhou2018mdce].
D.1 Primal problem
In the standard (discounted) MCEIRL framework, a learning agent aims to identify a policy that matches the feature expectations of the teacher’s demonstrations while simultaneously maximizing the (discounted) causal entropy of the policy, i.e., the learner solves the following optimization problem:
$\underset{\pi}{\mathrm{max}}$  ${H}^{\gamma}({A}_{0:\mathrm{\infty}}\parallel {S}_{0:\mathrm{\infty}}):={\displaystyle \sum _{t=0}^{\mathrm{\infty}}}{\gamma}^{t}\mathbb{E}\left[\mathrm{log}\pi ({a}_{t}\mid {s}_{t})\right]$  
subject to  ${\mu}_{r}(\pi )[i]={\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3})[i]\mathit{\hspace{1em}}\forall i\in \{1,2,\mathrm{\dots},{d}_{r}\}.$ 
Here, ${\mu}_{r}(\pi )[i]$ and ${\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3})[i]$ denote the scalar values of the ${i}^{\text{th}}$ reward feature. The idea is that without any further information beyond the teacher’s demonstrations, the most uncertain solution matching the reward feature expectation of those demonstrations should be preferred.
Formulating this as a minimization problem and spelling out all the constraints, we arrive at the following primal:
$\underset{\bm{\pi}={\{{\pi}_{t}\}}_{t=0}^{\mathrm{\infty}}}{\mathrm{min}}{H}^{\gamma}({A}_{0:\mathrm{\infty}}\parallel {S}_{0:\mathrm{\infty}})$  
subject to  
${\mu}_{r}({\pi}_{t})[i]={\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3})[i]\mathit{\hspace{1em}}\forall i\in \{1,2,\mathrm{\dots},{d}_{r}\}$  
${\pi}_{t}(as)\ge 0\mathit{\hspace{1em}}\forall a\in \mathcal{A},s\in \mathcal{S},t\ge 0$  
$\sum _{a\in \mathcal{A}}}{\pi}_{t}(as)=1\mathit{\hspace{1em}}\forall s\in \mathcal{S},t\ge 0$  
${\pi}_{t}(as)={\pi}_{{t}^{\prime}}(as)\mathit{\hspace{1em}}\forall a\in \mathcal{A},s\in \mathcal{S},t\ge 0,{t}^{\prime}\ge 0$ 
The last condition ensures that the policy $\pi $ is stationary.
D.2 Lagrangian relaxation
The Lagrangian relaxation optimization formulation of the above primal problem is given by
$\mathcal{L}(\bm{\pi},\bm{\lambda},\bm{\psi})$  $={H}^{\gamma}({A}_{0:\mathrm{\infty}}\parallel {S}_{0:\mathrm{\infty}})+{\bm{\lambda}}^{\u2020}({\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3}){\mu}_{r}({\pi}_{t}))+{\displaystyle \sum _{s,t}}{\psi}_{s,t}(1{\displaystyle \sum _{a\in \mathcal{A}}}{\pi}_{t}(as))$  
subject to  
${\pi}_{t}(as)\ge 0\mathit{\hspace{1em}}\forall a\in \mathcal{A},s\in \mathcal{S},t\ge 0$  
${\pi}_{t}(as)={\pi}_{{t}^{\prime}}(as)\mathit{\hspace{1em}}\forall a\in \mathcal{A},s\in \mathcal{S},t,{t}^{\prime}\ge 0$ 
Here, $\bm{\lambda}\in {\mathbb{R}}^{{d}_{r}}$ and $\bm{\psi}={\{{\psi}_{s,t}\}}_{\forall {s}_{t}}$. Also, $\u2020$ is the transpose operator defined for vectors.
Remark. The Lagrangian relaxation of the optimization problem is not convex in the problem variables because of the term ${\bm{\lambda}}^{\u2020}({\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3}){\mu}_{r}({\pi}_{t}))$ in the objective function, which is not convex in the variables ${\pi}_{t}$. However, it can be shown that strong duality holds for both its dual and primal formulations ([zhou2018mdce]). The dual formulation is described in Section D.4.
D.3 Parametric form of the policy
For a given $\bm{\lambda}$, the optimal policy ${\pi}_{\bm{\lambda}}^{\text{soft}}(as)$ is given by
${\pi}_{\bm{\lambda}}^{\text{soft}}(as)$  $={\displaystyle \frac{\mathrm{exp}({Q}_{\bm{\lambda}}^{\text{soft}}(s,a))}{\mathrm{exp}({V}_{\bm{\lambda}}^{\text{soft}}(s))}}$ 
where the quantities are defined recursively as follows:
${Q}_{\bm{\lambda}}^{\text{soft}}(s,a)$  $={\bm{\lambda}}^{\u2020}{\mu}_{r}({\pi}_{\bm{\lambda}}^{\text{soft}}(as))+\gamma {\displaystyle \sum _{{s}^{\prime}\in \mathcal{S}}}T({s}^{\prime}s,a){V}_{\bm{\lambda}}^{\text{soft}}({s}^{\prime})$  
${V}_{\bm{\lambda}}^{\text{soft}}(s)$  $=\text{log}{\displaystyle \sum _{a\in \mathcal{A}}}\mathrm{exp}({Q}_{\bm{\lambda}}^{\text{soft}}(s,a))$ 
This is shown by taking the derivative of the Lagrangian, $\mathcal{L}(\bm{\pi},\bm{\lambda},\bm{\psi})$ w.r.t. the primal variables ${\pi}_{t}$ and equating it to 0, i.e.,
$\frac{\partial L({\{{\pi}_{t}\}}_{t=0}^{\mathrm{\infty}},\bm{\lambda},\bm{\psi})}{\partial {\pi}_{t}}$  $=0.$ 
For a given $\bm{\lambda}$, the corresponding softmax policy can be obtained by SoftValueIteration procedure (see [ziebart2010modeling, Algorithm. 9.1], [zhou2018mdce]).
D.4 Dual problem
For any given $\bm{\lambda},\bm{\psi}$, let $g(\bm{\lambda},\bm{\psi})$ be the optimal value for the optimization problem defined by the Lagrangian relaxation problem in Section D.2. As strong duality holds for the (discounted) MCEIRL problem and its dual counter part, we solve only the following concave dual problem:
$\underset{\bm{\lambda}\in {\mathbb{R}}^{{d}_{r}},{\psi}_{s,t}\in \mathbb{R}}{\mathrm{maximize}}\mathit{\hspace{1em}}g(\bm{\lambda},\bm{\psi})$ 
D.5 Gradients for the dual variables
As the dual problem is concave, it can be solved using gradient ascent. The gradients of the dual function described in Section D.4 are given by:
${\nabla}_{\bm{\lambda}}\mathit{\hspace{1em}}g$  $={\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3}){\mu}_{r}({\pi}_{\bm{\lambda}}^{\text{soft}})$  
${\nabla}_{{\psi}_{s,t}}\mathit{\hspace{1em}}g$  $=1{\displaystyle \sum _{a\in \mathcal{A}}}{\pi}_{\bm{\lambda}}^{\text{soft}}(as)$ 
Here ${\pi}_{\bm{\lambda}}^{\text{soft}}$ is the parametric softmax policy described above. The second condition is automatically satisfied because ${\pi}_{\bm{\lambda}}^{\text{soft}}$ is a probability distribution.
The gradient update rule to compute the optimal $\bm{\lambda}$ is:
${\bm{\lambda}}_{\text{next}}\leftarrow \bm{\lambda}\eta \cdot \left({\mu}_{r}({\pi}_{\bm{\lambda}}^{\text{soft}}){\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3})\right)$ 
where $\eta $ is the learning rate.
Appendix E Details of (discounted) MCEIRL Problem with Preferences (Section 3.2)
Here we present the background of the learner model described in Section 3.2. In this setting, the learner’s preferences are modeled as linear soft constraints with L1 penalties. We consider the minimization variant of the problem. The results in this section follow directly from the analysis of Maximum Entropy Models under different constraints, as presented in [tsujii2005maxent, schap2007maxent] when applied to (discounted) MCEIRL problem [ziebart2013principle, zhou2018mdce]. For brevity, redundant details of the derivations are omitted.
The final policy of the learner is given by ${\pi}_{\bm{\lambda}}^{\text{soft}}$ and is defined in Section E.3.
E.1 Primal problem
The primal problem is given by
$\underset{\bm{\pi}={\{{\pi}_{t}\}}_{t=0}^{\mathrm{\infty}};{\delta}_{r}^{\text{soft},\text{low}},{\delta}_{r}^{\text{soft},\text{up}},{\delta}_{c}^{\text{soft},\text{up}}\ge 0}{\mathrm{min}}{H}^{\gamma}({A}_{0:\mathrm{\infty}}{S}_{0:\mathrm{\infty}})+{\displaystyle \sum _{i=1}^{{d}_{r}}}{C}_{r}\cdot ({\delta}_{r}^{\text{soft},\text{low}}[i]+{\delta}_{r}^{\text{soft},\text{up}}[i])+{\displaystyle \sum _{j=1}^{{d}_{c}}}{C}_{c}\cdot {\delta}_{c}^{\text{soft},\text{up}}[j]$  
subject to  
$\mathrm{\hspace{1em}\hspace{1em}}{\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3})[i]{\mu}_{r}({\pi}_{t})[i]\le {\delta}_{r}^{\text{soft},\text{low}}[i]\mathit{\hspace{1em}}\forall i\in \{1,2,\mathrm{\dots},{d}_{r}\}$  
$\mathrm{\hspace{1em}\hspace{1em}}{\mu}_{r}({\pi}_{t})[i]{\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3})[i]\le {\delta}_{r}^{\text{soft},\text{up}}[i]\mathit{\hspace{1em}}\forall i\in \{1,2,\mathrm{\dots},{d}_{r}\}$  
$\mathrm{\hspace{1em}\hspace{1em}}{\mu}_{c}({\pi}_{t})[j]\le {\delta}_{c}^{\text{hard}}[j]+{\delta}_{c}^{\text{soft},\text{up}}[j]\mathit{\hspace{1em}}\forall j\in \{1,2,\mathrm{\dots},{d}_{c}\}$ 
Here we have ${\delta}_{r}^{\text{soft},\text{low}},{\delta}_{r}^{\text{soft},\text{up}}\in {\mathbb{R}}^{{d}_{r}}$ and ${\delta}_{c}^{\text{soft},\text{up}}\in {\mathbb{R}}^{{d}_{c}}$ as the primal optimization slack variables with the constraint that ${\delta}_{r}^{\text{soft},\text{low}},{\delta}_{r}^{\text{soft},\text{up}},{\delta}_{c}^{\text{soft},\text{up}}\ge 0$. We also have ${C}_{r}>0,{C}_{c}>0$. ${\delta}_{c}^{\text{hard}}\in {\mathbb{R}}^{{d}_{c}}$ is a given constant vector.
Remark. low and up in the superscripts of dual variables represent whether they are variables for lower bound constraints or upper bound constraints.
E.2 Lagrangian relaxation
The Lagrangian relaxation optimization formulation of the primal problem described in Section E.1 is given by
$\mathcal{L}(\bm{\pi},{\delta}_{r}^{\text{soft},\text{low}},{\delta}_{r}^{\text{soft},\text{up}},{\delta}_{c}^{\text{soft},\text{up}},\bm{\lambda},\bm{\psi})$  $={H}^{\gamma}({A}_{0:\mathrm{\infty}},{S}_{0:\mathrm{\infty}})+{({\bm{\alpha}}^{\text{low}}{\bm{\alpha}}^{\text{up}})}^{\u2020}({\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3}){\mu}_{r}({\pi}_{t}))$  
$+{\bm{\beta}}^{\u2020}{\mu}_{c}({\pi}_{t})$  
$+{\displaystyle \sum _{s,t}}{\psi}_{s,t}(1{\displaystyle \sum _{a\in \mathcal{A}}}{\pi}_{t}(as)){({\bm{\alpha}}^{\text{low}})}^{\u2020}{\delta}_{r}^{\text{soft},\text{low}}{({\bm{\alpha}}^{\text{up}})}^{\u2020}{\delta}_{r}^{\text{soft},\text{up}}$  
${\bm{\beta}}^{\u2020}{\delta}_{c}^{\text{soft},\text{up}}{\bm{\beta}}^{\u2020}{\delta}_{c}^{\text{hard}}$  
${({\bm{\rho}}^{\text{low}})}^{\u2020}{\delta}_{r}^{\text{soft},\text{low}}{({\bm{\rho}}^{\text{up}})}^{\u2020}{\delta}_{r}^{\text{soft},\text{up}}$  
${\bm{\sigma}}^{\u2020}{\delta}_{c}^{\text{soft},\text{up}}$  
$+{\displaystyle \sum _{i=1}^{{d}_{r}}}{C}_{r}\cdot ({\delta}_{r}^{\text{soft},\text{low}}[i]+{\delta}_{r}^{\text{soft},\text{up}}[i])+{\displaystyle \sum _{j=1}^{{d}_{c}}}{C}_{c}\cdot {\delta}_{c}^{\text{soft},\text{up}}[j]$  
subject to  
${\pi}_{t}(as)\ge 0\mathit{\hspace{1em}}\forall a\in \mathcal{A},s\in \mathcal{S},t\ge 0$  
${\pi}_{t}(as)={\pi}_{{t}^{{}^{\prime}}}(as)\mathit{\hspace{1em}}\forall a\in \mathcal{A},s\in \mathcal{S},t,{t}^{{}^{\prime}}\ge 0$ 
Here, ${\bm{\alpha}}^{\text{low}},{\bm{\alpha}}^{\text{up}},{\bm{\rho}}^{low},{\bm{\rho}}^{up}\in {\mathbb{R}}^{{d}_{r}}$, and $\bm{\beta},\bm{\sigma}\in {\mathbb{R}}^{{d}_{c}}$. We also have nonnegativity constraints on the dual variables: ${\bm{\alpha}}^{\text{low}},{\bm{\alpha}}^{\text{up}},\bm{\beta},{\bm{\rho}}^{\text{low}},{\bm{\rho}}^{\text{up}},\bm{\sigma}\ge 0$. A few additional notes:

•
For convenience, we will denote the group of dual variables as $\bm{\lambda}:=\{{\bm{\alpha}}^{\text{low}},{\bm{\alpha}}^{\text{up}},\bm{\beta},{\bm{\rho}}^{\text{low}},{\bm{\rho}}^{\text{up}},\bm{\sigma}\}$

•
The reward parameter ${\bm{w}}_{\bm{\lambda}}={[{({\bm{\alpha}}^{\text{low}}{\bm{\alpha}}^{\text{up}})}^{\u2020},{\bm{\beta}}^{\u2020}]}^{\u2020}$ is used to define the learner’s reward function ${R}_{\bm{\lambda}}(s)=\u27e8{\bm{w}}_{\bm{\lambda}},\varphi (s)\u27e9$.

•
$\u2020$ is the transpose operator, defined for vectors.
E.3 Parametric form of the policy
For a given, $\bm{\lambda}:=\{{\bm{\alpha}}^{\text{low}},{\bm{\alpha}}^{\text{up}},\bm{\beta},{\bm{\rho}}^{\text{low}},{\bm{\rho}}^{\text{up}},\bm{\sigma}\}$, the optimal policy ${\pi}_{\bm{\lambda}}^{\text{soft}}(as)$ is given by
${\pi}_{\bm{\lambda}}^{\text{soft}}(as)$  $={\displaystyle \frac{\mathrm{exp}({Q}_{\bm{\lambda}}^{\text{soft}}(s,a))}{\mathrm{exp}({V}_{\bm{\lambda}}^{\text{soft}}(s))}}$ 
where the quantities are defined recursively as follows:
${Q}_{\bm{\lambda}}^{\text{soft}}(s,a)$  $={({\bm{\alpha}}_{low}{\bm{\alpha}}_{up})}^{\u2020}{\mu}_{r}({\pi}_{\bm{\lambda}}^{\text{soft}}(as)){\bm{\beta}}^{\u2020}{\mu}_{c}({\pi}_{\bm{\lambda}}^{\text{soft}}(as))+\gamma {\displaystyle \sum _{{s}^{{}^{\prime}}\in \mathcal{S}}}T({s}^{{}^{\prime}}s,a){V}_{\bm{\lambda}}^{\text{soft}}({s}^{{}^{\prime}})$  
${V}_{\bm{\lambda}}^{\text{soft}}(s)$  $=\text{log}({\displaystyle \sum _{a\in \mathcal{A}}}\mathrm{exp}({Q}_{\bm{\lambda}}^{\text{soft}}(s,a)))$ 
This is shown by taking the derivative of the Lagrangian, $\mathcal{L}(\bm{\pi},\bm{\lambda},\bm{\psi})$ w.r.t the primal variables, ${\pi}_{t}$ and equating it to 0. i.e.
$\frac{\partial L({\{{\pi}_{t}\}}_{t=0}^{\mathrm{\infty}},\bm{\lambda},\bm{\psi})}{\partial {\pi}_{t}}$  $=0$ 
E.4 Updated Lagrangian
We find the partial derivatives of the Lagrangian defined in Section E.2 w.r.t all the primal variables, ${\delta}_{r}^{\text{soft},\text{low}},{\delta}_{r}^{\text{soft},\text{up}},{\delta}_{c}^{\text{soft},\text{up}}$:
$\frac{\partial \mathcal{L}}{\partial {\delta}_{r}^{\text{soft},\text{low}}[i]}$  $=0$  
$\Rightarrow {\alpha}^{\text{low}}[i]$  $={C}_{r}{\rho}^{\text{low}}[i]$  
$\text{Also,}{\displaystyle \frac{\partial \mathcal{L}}{\partial {\delta}_{r}^{\text{soft},\text{up}}}}$  $=0$  
$\Rightarrow {\alpha}^{\text{up}}[i]$  $={C}_{r}{\rho}^{\text{up}}[i]$  
$\text{And,}{\displaystyle \frac{\partial \mathcal{L}}{\partial {\delta}_{c}^{\text{soft},\text{up}}}}$  $=0$  
$\Rightarrow \beta [i]$  $={C}_{r}\sigma [i]$ 
The dual variables satisfy $\bm{\sigma},{\bm{\rho}}^{\text{low}},{\bm{\rho}}^{\text{up}}\ge 0$. Hence, the above conditions translate into the following constraints on the set of dual variables, ${\bm{\alpha}}^{\text{low}},{\bm{\alpha}}^{\text{up}},\bm{\beta}$:
$0\le {\alpha}^{\text{low}}[i]\le {C}_{r}\mathit{\hspace{1em}}\forall i\in \{1,2,\mathrm{\dots},{d}_{r}\}$  
$0\le {\alpha}^{\text{up}}[i]\le {C}_{r}\mathit{\hspace{1em}}\forall i\in \{1,2,\mathrm{\dots},{d}_{r}\}$  
$0\le \beta [j]\le {C}_{c}\mathit{\hspace{1em}}\forall j\in \{1,2,\mathrm{\dots},{d}_{c}\}$ 
The updated Lagrangian now has these additional constraints and is given by:
$\mathcal{L}(\bm{\pi},{\delta}_{r}^{\text{soft},\text{low}},{\delta}_{r}^{\text{soft},\text{up}},{\delta}_{c}^{\text{soft},\text{up}},\bm{\lambda},\bm{\psi})$  $={H}^{\gamma}({A}_{0:\mathrm{\infty}},{S}_{0:\mathrm{\infty}})+{({\bm{\alpha}}^{\text{low}}{\bm{\alpha}}^{\text{up}})}^{\u2020}({\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3}){\mu}_{r}({\pi}_{t}))+{\bm{\beta}}^{\u2020}{\mu}_{c}({\pi}_{t})$  
$+{\displaystyle \sum _{s,t}}{\psi}_{s,t}(1{\displaystyle \sum _{a\in \mathcal{A}}}{\pi}_{t}(as)){({\bm{\alpha}}^{\text{low}})}^{\u2020}{\delta}_{r}^{\text{soft},\text{low}}{({\bm{\alpha}}^{up})}^{\u2020}{\delta}_{r}^{\text{soft},\text{up}}$  
${\bm{\beta}}^{\u2020}{\delta}_{c}^{\text{soft},\text{up}}{\bm{\beta}}^{\u2020}{\delta}_{c}^{\text{hard}}$  
${({\bm{\rho}}^{\text{low}})}^{\u2020}{\delta}_{r}^{\text{soft},\text{low}}{({\bm{\rho}}^{\text{up}})}^{\u2020}{\delta}_{r}^{\text{soft},\text{up}}$  
${\bm{\sigma}}^{\u2020}{\delta}_{c}^{\text{soft},\text{up}}$  
$+{\displaystyle \sum _{i=1}^{{d}_{r}}}{C}_{r}\cdot ({\delta}_{r}^{\text{soft},\text{low}}[i]+{\delta}_{r}^{\text{soft},\text{up}}[i])+{\displaystyle \sum _{j=1}^{{d}_{c}}}{C}_{c}\cdot {\delta}_{c}^{\text{soft},\text{up}}[j]$  
subject to  
${\pi}_{t}(as)\ge 0\mathit{\hspace{1em}}\forall a\in \mathcal{A},s\in \mathcal{S},t\ge 0$  
${\pi}_{t}(as)={\pi}_{{t}^{{}^{\prime}}}(as)\mathit{\hspace{1em}}\forall a\in \mathcal{A},s\in \mathcal{S},t,{t}^{{}^{\prime}}\ge 0$  
$0\le {\alpha}^{\text{low}}[i]\le {C}_{r}\mathit{\hspace{1em}}\forall i\in \{1,2,\mathrm{\dots},{d}_{r}\}$  
$0\le {\alpha}^{\text{up}}[i]\le {C}_{r}\mathit{\hspace{1em}}\forall i\in \{1,2,\mathrm{\dots},{d}_{r}\}$  
$0\le \beta [j]\le {C}_{c}\mathit{\hspace{1em}}\forall j\in \{1,2,\mathrm{\dots},{d}_{c}\}$ 
The set of dual variables becomes $\bm{\lambda}:=\{{\bm{\alpha}}^{\text{low}},{\bm{\alpha}}^{\text{up}},\bm{\beta}\}$ and $\bm{\psi}={\{{\psi}_{s,t}\}}_{\forall {s}_{t}}$.
E.5 Dual problem
For any given $\bm{\lambda}\mathbf{,}\bm{\psi}$, let $g(\bm{\lambda}\mathbf{,}\bm{\psi})$ be the optimal value for the Lagrangian relaxation problem. Strong Duality holds for both our primal and dual formulations, and the dual optimal policy is also optimal for the primal formulation. Hence, we solve the concave dual problem, given by
$\underset{{\bm{\alpha}}^{\text{low}},{\bm{\alpha}}^{\text{up}}\in {\mathbb{R}}^{{d}_{r}},\bm{\beta}\in {\mathbb{R}}^{{d}_{c}},{\psi}_{s,t}\in \mathbb{R}}{\text{maximize}}\text{}g(\bm{\lambda},\bm{\psi})$  
subject to  
$\mathrm{\hspace{1em}}0\le {\bm{\alpha}}^{\text{low}}\le {C}_{r}$  
$\mathrm{\hspace{1em}}0\le {\bm{\alpha}}^{\text{up}}\le {C}_{r}$  
$\mathrm{\hspace{1em}}0\le \bm{\beta}\le {C}_{c}$ 
where $\bm{\lambda}:=\{{\bm{\alpha}}^{\text{low}},{\bm{\alpha}}^{\text{up}},\bm{\beta}\}$.
E.6 Gradients for the dual problem
As the dual problem is concave, it can be solved using gradient ascent.
Note that,
${\nabla}_{{\psi}_{s,t}}g$  $=1{\displaystyle \sum _{a\in \mathcal{A}}}{\pi}_{\bm{\lambda}}^{\text{soft}}(as)$ 
Here ${\pi}_{\bm{\lambda}}^{\text{soft}}$ is the parametric softmax policy described above. This condition is automatically satisfied because ${\pi}_{\bm{\lambda}}^{\text{soft}}$ is a probability distribution. For the remaining dual variables, we have the following gradients:
${\nabla}_{{\bm{\alpha}}^{\text{low}}}\text{}g$  $={\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3}){\mu}_{r}({\pi}_{\bm{\lambda}}^{\text{soft}})$  
${\nabla}_{{\bm{\alpha}}^{\text{up}}}\mathit{\hspace{1em}}g$  $={\mu}_{r}({\pi}_{\bm{\lambda}}^{\text{soft}}){\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3})$  
${\nabla}_{\bm{\beta}}\text{}g$  $={\mu}_{c}({\pi}_{\bm{\lambda}}^{\text{soft}})$ 
The (projected) gradient update rules to compute the optimal value of the dual variables $({\bm{\alpha}}^{\text{low}},{\bm{\alpha}}^{\text{up}},\bm{\beta})$ are given by the following:
${\bm{\alpha}}_{\text{next}}^{\text{low}}$  $\leftarrow {\bm{\alpha}}^{\text{low}}\eta \cdot ({\mu}_{r}({\pi}_{\bm{\lambda}}^{\text{soft}}){\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3}))$  
${\alpha}_{\text{next}}^{\text{low}}[i]$  $\leftarrow \mathrm{max}(0,{\alpha}_{\text{next}}^{\text{low}}[i])\mathit{\hspace{1em}}\forall i\in \{1,2,\mathrm{\dots},{d}_{r}\}$  
${\alpha}_{\text{next}}^{\text{low}}[i]$  $\leftarrow \mathrm{min}({C}_{r},{\alpha}_{\text{next}}^{\text{low}}[i])\mathit{\hspace{1em}}\forall i\in \{1,2,\mathrm{\dots},{d}_{r}\}$ 
${\bm{\alpha}}_{\text{next}}^{\text{up}}$  $\leftarrow {\bm{\alpha}}^{\text{up}}\eta \cdot ({\widehat{\mu}}_{r}({\mathrm{\Xi}}^{\U0001d5b3}){\mu}_{r}({\pi}_{\bm{\lambda}}^{\text{soft}}))$  
${\alpha}_{\text{next}}^{\text{up}}[i]$  $\leftarrow \mathrm{max}(0,{\alpha}_{\text{next}}^{\text{up}}[i])\mathit{\hspace{1em}}\forall i\in \{1,2,\mathrm{\dots},{d}_{r}\}$  
${\alpha}_{\text{next}}^{\text{up}}[i]$  $\leftarrow \mathrm{min}({C}_{r},{\alpha}_{\text{next}}^{\text{up}}[i])\mathit{\hspace{1em}}\forall i\in \{1,2,\mathrm{\dots},{d}_{r}\}$ 
${\bm{\beta}}_{\text{next}}$  $\leftarrow \bm{\beta}\eta \cdot ({\mu}_{c}({\pi}_{\bm{\lambda}}^{\text{soft}}))$  
${\beta}_{\text{next}}[j]$  $\leftarrow \mathrm{max}(0,{\beta}_{\text{next}}[j])\mathit{\hspace{1em}}\forall j\in \{1,2,\mathrm{\dots},{d}_{c}\}$  
${\beta}_{\text{next}[j]}$  $\leftarrow \mathrm{min}({C}_{c},{\beta}_{\text{next}}[j])\mathit{\hspace{1em}}\forall j\in \{1,2,\mathrm{\dots},{d}_{c}\}$ 
where $\eta $ is the learning rate.
Appendix F LP Formulation for the Teacher AwareCMDP (Section 4.1)
The problem of finding optimal learneraware teaching demonstrations for the learner in Section 3.1 with linear preferences can be formulated as the following linear program (based on the linear programming formulation for solving MDPs [de1960problemes]):
$\underset{z}{\mathrm{max}}$  $\sum _{s}}{\displaystyle \sum _{a}}z(s,a)\u27e8{\mathbf{w}}_{r}^{*},{\varphi}_{r}(s)\u27e9$  (19)  
s.t.  $\sum _{a}}z({s}^{\prime},a)=(1\gamma ){P}_{0}({s}^{\prime})+\gamma {\displaystyle \sum _{s}}{\displaystyle \sum _{a}}T({s}^{\prime}s,a)z(s,a)\mathit{\hspace{1em}}\forall {s}^{\prime$  (20)  
$z(s,a)\ge 0\mathit{\hspace{1em}}\forall s,a$  (21)  
$\sum _{s}}{\displaystyle \sum _{a}}z(s,a){\varphi}_{c}(s)[j]\le {\delta}_{c}^{\text{hard}}[j]\mathit{\hspace{1em}}\forall j\in \{1,2,\mathrm{\dots},{d}_{c}\$  (22) 
Here $z$ is a vector of discounted stateaction frequencies and $z(s,a)$ refers to stateaction frequency for state $s$ and action $a$. The constraints in (22) are the linear preference constraints. From the optimal solution of the LP, an optimal stochastic policy can be extracted by
$\pi (s,a):={\displaystyle \frac{z(s,a)}{{\sum}_{{a}^{\prime}}z(s,{a}^{\prime})}}.$  (23) 
Appendix G BiLevel Optimization Approach (Section 4.2)
We only show the formalism for the most general bilevel problem for learners with linear preferences.
G.1 Using Dual (discounted) MCEIRL formulation for the learner model in Section 3.2
The basic bilevel optimization problem that we aim to solve is the following:
$\underset{{\pi}^{\U0001d5b3}}{\mathrm{max}}$  $\mathrm{\hspace{1em}}R({\pi}^{\U0001d5ab})$  
subject to  $\mathrm{\hspace{1em}}{\pi}^{\U0001d5ab}\in \mathrm{arg}\underset{\pi}{\mathrm{max}}\text{IRL}(\pi ,\mu ({\pi}^{\U0001d5b3})).$ 
We will replace the lowerlevel problem, i.e., $\mathrm{arg}{\mathrm{max}}_{\pi}\text{IRL}(\pi ,\mu ({\pi}^{\U0001d5b3}))$ with its KarushKuhnTucker conditions [boyd2004convex, sinha2018review]. The lowerlevel problem in its dual formulation is given in Appendix E.5.
Omitting details and replacing $R({\pi}_{\bm{\lambda}}):=\u27e8{\mathbf{w}}_{r}^{*},{\mu}_{r}({\pi}_{\bm{\lambda}})\u27e9$, this yields problems of the following form:
$\underset{\bm{\lambda}}{\mathrm{max}}$  $\mathrm{\hspace{1em}}\u27e8{\mathbf{w}}_{r}^{*},{\mu}_{r}({\pi}_{\bm{\lambda}})\u27e9$  
subject to:  
$\mathrm{\hspace{1em}}0\le {\bm{\alpha}}^{\text{low}}\le {C}_{r}$  
$\mathrm{\hspace{1em}}0\le {\bm{\alpha}}^{\text{up}}\le {C}_{r}$  
$\mathrm{\hspace{1em}}0\le \bm{\beta}\le {C}_{c}$  
$\mathrm{\hspace{1em}}{\mu}_{c}({\pi}_{\bm{\lambda}})\le (\ge ){\delta}_{c}^{\text{hard}}$ 
where $\bm{\lambda}:=\{{\bm{\alpha}}^{\text{low}},{\bm{\alpha}}^{\text{up}},\bm{\beta}\}$. Here ${\pi}_{\bm{\lambda}}$ corresponds to a softmax policy with a reward function ${R}_{\bm{\lambda}}(s)=\u27e8{\bm{w}}_{\bm{\lambda}},\varphi (s)\u27e9$ for ${\bm{w}}_{\bm{\lambda}}={[{({\bm{\alpha}}^{\text{low}}{\bm{\alpha}}^{\text{up}})}^{\u2020},{\bm{\beta}}^{\u2020}]}^{\u2020}$. Thus, finding optimal demonstrations means optimization over softmax teaching policies while respecting the learner’s preferences.
G.1.1 Optimal solution
The cases of the above problem we can observe have to be solved separately and the best solution must be picked. That is, we
find the following two solutions: (step i) ${\bm{\lambda}}_{1}^{*}$, and (step ii) ${\bm{\lambda}}_{2}^{*}$. Then pick the best ${\bm{\lambda}}^{*}$ in (step iii):
Step i: ${\lambda}_{\mathrm{1}}^{\mathrm{*}}$
Compute optimal parameters ${\bm{\lambda}}_{1}^{*}$ by solving the following problem:
$\underset{\lambda}{\mathrm{max}}$  $\mathrm{\hspace{1em}}\u27e8{\mathbf{w}}_{r}^{*},{\mu}_{r}({\pi}_{\bm{\lambda}})\u27e9$  
subject to:  
$\mathrm{\hspace{1em}}0\le {\bm{\alpha}}^{\text{low}}\le {C}_{r}$  
$\mathrm{\hspace{1em}}0\le {\bm{\alpha}}^{\text{up}}\le {C}_{r}$  
$\mathrm{\hspace{1em}}0\le \bm{\beta}\le {C}_{c}$  
$\mathrm{\hspace{1em}}{\mu}_{c}({\pi}_{\bm{\lambda}})\le {\delta}_{c}^{\text{hard}}$ 
Step ii: ${\lambda}_{\mathrm{2}}^{\mathrm{*}}$ Compute optimal parameters ${\bm{\lambda}}_{2}^{*}$ by solving the following problem:
$\underset{\bm{\lambda}}{\mathrm{max}}$  $\mathrm{\hspace{1em}}\u27e8{\mathbf{w}}_{r}^{*},{\mu}_{r}({\pi}_{\bm{\lambda}})\u27e9$  (24)  
subject to:  (25)  
$\mathrm{\hspace{1em}}0\le {\bm{\alpha}}^{\text{low}}\le {C}_{r}$  (26)  
$\mathrm{\hspace{1em}}0\le {\bm{\alpha}}^{\text{up}}\le {C}_{r}$  (27)  
$\mathrm{\hspace{1em}}\bm{\beta}={C}_{c}$  (28)  
$\mathrm{\hspace{1em}}{\mu}_{c}({\pi}_{\bm{\lambda}})\ge {\delta}_{c}^{\text{hard}}$  (29) 
Step iii: ${\lambda}^{\mathrm{*}}$ Pick the best solution as
${\bm{\lambda}}^{*}=\mathrm{arg}\underset{\bm{\lambda}\in \{{\bm{\lambda}}_{1}^{*},{\bm{\lambda}}_{2}^{*}\}}{\mathrm{max}}\u27e8{\mathbf{w}}_{r}^{*},{\mu}_{r}({\pi}_{\bm{\lambda}})\u27e9$ 
This provides the optimal policy for the teacher. The teacher then computes feature expectation of this policy and provide it to the learner.
G.2 Solving the above problem
We adopt a variant of the FrankWolfe algorithm [jaggi2013] to solve the problems of the form:
$\underset{\bm{\lambda}}{\mathrm{max}}$  $\mathrm{\hspace{1em}}R({\pi}_{\bm{\lambda}}):=\u27e8{\mathbf{w}}_{r}^{*},{\mu}_{r}({\pi}_{\bm{\lambda}})\u27e9$  (30)  
subject to:  (31)  
$\mathrm{\hspace{1em}}0\le {\bm{\alpha}}^{\text{low}}\le {C}_{r}$  (32)  
$\mathrm{\hspace{1em}}0\le {\bm{\alpha}}^{\text{up}}\le {C}_{r}$  (33)  
$\mathrm{\hspace{1em}}0\le \bm{\beta}\le {C}_{c}$  (34)  
$\mathrm{\hspace{1em}}{\mu}_{c}({\pi}_{\bm{\lambda}})\le (\ge ){\delta}_{c}^{\text{hard}}$  (35) 
In particular, we take the following steps to optimize the teaching policy ${\pi}_{\bm{\lambda}}$:

1.
Initialization. Find a feasible starting point ${\bm{\lambda}}_{0}$

2.
Optimization. For $t=1,2,\mathrm{\dots}$

•
Compute the gradient ${\bm{g}}_{t}=[{\nabla}_{\bm{\lambda}}R({\pi}_{\bm{\lambda}})]({\bm{\lambda}}_{t1})$ of the objective at ${\bm{\lambda}}_{t1}$. In experiments we approximate the gradient using finitedifferences.

•
Linearize the constraints ${\mu}_{c}({\pi}_{\bm{\lambda}})\le (\ge ){\delta}_{c}^{\text{hard}}$ at ${\bm{\lambda}}_{t1}$ as ${\bm{b}}_{t}+{\bm{A}}_{t}(\bm{\lambda}{\bm{\lambda}}_{t1})\le (\ge ){\delta}_{c}^{\text{hard}}$, where ${\bm{b}}_{t}={\mu}_{c}({\pi}_{{\bm{\lambda}}_{t1}})$ and ${\bm{A}}_{t}=[{\nabla}_{\bm{\lambda}}{\mu}_{c}({\pi}_{\bm{\lambda}})]({\bm{\lambda}}_{t1})$. Again, we employ finitedifferences to approximate this linearization. Clearly, we can reuse computation from the gradient estimation of the objective here to reduce computational demands.

•
Solve the directionfinding subproblem (a linear problem):
$\underset{\bm{\gamma}}{\mathrm{max}}$ $\mathrm{\hspace{1em}}\u27e8\bm{\gamma},{\bm{g}}_{t}\u27e9$ subject to: $\mathrm{\hspace{1em}}0\le {\bm{\alpha}}^{\text{low}}\le {C}_{r}$ $\mathrm{\hspace{1em}}0\le {\bm{\alpha}}^{\text{up}}\le {C}_{r}$ $\mathrm{\hspace{1em}}0\le \bm{\beta}\le {C}_{c}$ $\mathrm{\hspace{1em}}{\bm{b}}_{t}+{\bm{A}}_{t1}(\bm{\lambda}{\bm{\lambda}}_{t1})\le (\ge ){\delta}_{c}^{\text{hard}}$ with optimal solution ${\bm{\gamma}}_{t}^{*}$. Assuming that the linear approximation of the constraints is accurate locally, the directional vector ${\bm{d}}_{t}={\bm{\gamma}}_{t}^{*}{\bm{\lambda}}_{t1}$ is an ascent direction.

•
Perform a linesearch from ${\bm{\lambda}}_{t1}$ to ${\bm{\gamma}}_{t}^{*}$ and let ${\bm{\lambda}}_{t}$ be the point that maximizes the line search.

•
Upon convergence, terminate the For loop.

•
Upon convergence of the algorithm, the teacher can use the final ${\bm{\lambda}}_{t}$ for teaching.
Remark. Observe that the above algorithm would reduce to the standard FrankWolfe algorithm with linesearch in the case of linear inequalities only.