On the Correctness and Sample Complexity of Inverse Reinforcement Learning

  • 2019-06-02 15:22:42
  • Abi Komanduru, Jean Honorio
  • 1

Abstract

Inverse reinforcement learning (IRL) is the problem of finding a rewardfunction that generates a given optimal policy for a given Markov DecisionProcess. This paper looks at an algorithmic-independent geometric analysis ofthe IRL problem with finite states and actions. A L1-regularized Support VectorMachine formulation of the IRL problem motivated by the geometric analysis isthen proposed with the basic objective of the inverse reinforcement problem inmind: to find a reward function that generates a specified optimal policy. Thepaper further analyzes the proposed formulation of inverse reinforcementlearning with $n$ states and $k$ actions, and shows a sample complexity of$O(n^2 \log (nk))$ for recovering a reward function that generates a policythat satisfies Bellman's optimality condition with respect to the truetransition probabilities.

 

Quick Read (beta)

On the Correctness and Sample Complexity of Inverse Reinforcement Learning

Abi Komanduru Purdue University, West Lafayette IN 47906 Jean Honorio Purdue University, West Lafayette IN 47906
Abstract

Inverse reinforcement learning (IRL) is the problem of finding a reward function that generates a given optimal policy for a given Markov Decision Process. This paper looks at an algorithmic-independent geometric analysis of the IRL problem with finite states and actions. A L1-regularized Support Vector Machine formulation of the IRL problem motivated by the geometric analysis is then proposed with the basic objective of the inverse reinforcement problem in mind: to find a reward function that generates a specified optimal policy. The paper further analyzes the proposed formulation of inverse reinforcement learning with n states and k actions, and shows a sample complexity of O(n2log(nk)) for recovering a reward function that generates a policy that satisfies Bellman’s optimality condition with respect to the true transition probabilities.

1 Introduction

Reinforcement Learning is the process of generating an optimal policy for a given Markov Decision Process along with a reward function. Often, in situations including apprenticeship learning, the reward function is unknown but optimal policy can be observed through the actions of an expert. In cases such as these, it is desirable to learn a reward function generating the observed optimal policy. This problem is referred to as Inverse Reinforcement Learning (IRL) [5]. It is well known that such a reward function is not necessarily unique. Various algorithms to solve the IRL problem have been made including linear programming [5] and Bayesian estimation [6]. [2] looked at using IRL to solve the apprenticeship learning problem by trying to find a reward function that maximizes the margin of the expert’s policy. The goal of the problem presented in [2] is to find a policy that comes close in value to the expert’s policy for some unspecified true reward function. However none of the prior works provide a formal guarantee that the reward function obtained from the empirical data is optimal for the true transition probabilities in inverse reinforcement learning.

This paper looks at formulating the IRL problem by using the basic objective of inverse reinforcement: to find a reward function that generates a specified optimal policy. The paper also looks at establishing a sample complexity to meet this basic goal when the transition probabilities are estimated from observed trajectories. To achieve this, an algorithmic-independent geometric analysis of the IRL problem with finite states and actions as presented in [5] is provided. A L1-regularized Support Vector Machine (SVM) formulation of the IRL problem motivated by the geometric analysis is then proposed. Theoretical analysis of the sample complexity of the L1 SVM formulation is then performed. Finally, experimental results comparing the L1 SVM formulation to the linear programming formulation presented in [5] are presented showing the improved performance of the L1 SVM formulation with respect to the basic objective, i.e Bellman optimality with respect to the true transition probabilities. To the best of our knowledge, we are the first to provide an algorithm with formal guarantees for inverse reinforcement learning.

2 Preliminaries

The formulation of the IRL problem is based on a Markov Decision Process (MDP) (S,A,{Psa},γ,R), where  

  • S is a finite set of n states.

  • A={a1,,ak} is a set of k actions.

  • Pa[0,1]n×n are the state transition probabilities for action a. We use Pa(s)[0,1]n and Psa[0,1]n to represent the state transition probabilities for action a in state s and Pa(i,j)[0,1] to represent the probability of going from state i to state j when taking action a.

  • γ[0,1] is the discount factor.

  • R:S is the reinforcement or reward function.

It is important to note that the state transition probability matrices are right stochastic. Mathematically this can be stated as Pa(i,j)0i,jandjPa(i,j)=1i

In the subsequent sections, we use the notation

Fai:=(Pa1(i)-Pa(i))(I-γPa1)-1

The empirical maximum likelihood estimates of the transition probabilities from sampled trajectories are denoted by P^a(i) and in a similar fashion we use the notation

F^ai:=(P^a1(i)-P^a(i))(I-γP^a1)-1

The following norms are used throughout this paper. The infinity norm of a matrix A=[aij] is defined as A=supi,j|aij|. The L1 norm of a vector is defined as b1=i|bi|. The induced matrix norm is defined as |||A|||=supjaj1 where aj is the j-th row of the matrix A. Note that for a right stochastic matrix P, we can see that |||P|||=1 and P1.

In this paper the reward function is assumed to be a function of purely the state instead of the state and the action. This assumption is also made for the initial results in [5]. A policy is defined as a map π:SA. Given a policy π, we can define two functions.
 
The value function at a state s1 is defined as

Vπ(s1)=𝔼[R(s1)+γR(π(s1))+γ2R(π(π(s1)))+π]

The Q function is defined as

Qπ(s,a)=R(s)+γ𝔼sPa(s)[Vπ(s)]

From [5], the inverse reinforcement learning problem for finite states and actions can be formulated as the following linear programming problem, assuming without loss of generality that π*a1. By enforcing the Bellman optimality of the policy π*, linear constraints on the reward function are formed. [5] then suggest some "natural" criteria that then forms the basis of the objective function to be minimized to obtained the desired reward function. The formulation presented in [5] is as follows.

maximize𝑅 i=1Nmina{a2,,ak}(F^aiTR)-λR1 (2.1)
subject to F^aiTR0aAa1,i=1,,n
RRmax

3 Geometric analysis of the IRL problem

The objective of the Inverse Reinforcement Learning problem is to find a reward function that generates an optimal policy. As shown in [5], the necessary and sufficient conditions for a policy π* (without loss of generality π*a1) to be optimal are given by the Bellman Optimality principle and can be stated mathematically as

FaiTR0aAa1,i=1,,n

Clearly, R=0 is always a solution. However this solution is degenerate in the sense that it also allows any and every other policy to be "optimal" and as a result is not of practical use. If the constraint of R0 is considered, then by noticing that the points Fain, the set of reward functions generating the optimal policy π1 is then the set of hyperplanes passing through the origin for which the entire collection of points {Fai} lie in one half space. The problem of Inverse Reinforcement Learning, then is equivalent to the problem of finding such a separating hyperplane passing through the origin for the points {Fai}. Here we also assume none of the Fai=0 as this would mean that there is no distinction between the policies π=a and π1=a1.

This geometric perspective of the IRL problem allows the classification of all finite state, finite action IRL problems into 3 regimes, graphically visualized in Figure 1:

Figure 1: Left: An example graphical visualization of Regime 1 where the origin lies inside the convex hull of {Fai}. Here no hyperplane passing through the origin exists for which all the points {Fai} lie in one half space. Center: An example graphical visualization of Regime 2 where the origin lies on the boundary of the convex hull of {Fai}. Here only one hyperplane passing through the origin exists for which all the points {Fai} lie in one half space. Right: An example graphical visualization of Regime 3 where the origin lies outside the convex hull of {Fai}. Here infinitely many hyperplanes passing through the origin exist for which all the points {Fai} lie in one half space.

Regime 1: In this regime, there is no hyperplane passing through the origin for which all the points {Fai} lie in one half space. This is equivalent to saying that the origin is in the interior of the convex hull of the points {Fai}. In this case, independent of the algorithm, there is no nonzero reward function for which the policy π1 is optimal.

Regime 2: In this regime, up to scaling by a constant, there can be one or more hyperplanes passing through the origin for which all the points {Fai} lie in one half space, however the hyperplanes always contain one of the points {Fai}. This is equivalent to saying that the origin is on the boundary of the convex hull of the points {Fai} but is not one of the vertices since by assumption Fai0. In this case, up to a constant scaling, there are one or more nonzero reward functions that generates the optimal policy π1. In this case, it is also important to notice that the policy π1 cannot be strictly optimal for any of the reward functions.

Regime 3: In this regime, up to scaling by a constant, there are infinitely many hyperplanes passing through the origin for which all the points {Fai} lie in one half space. This is equivalent to saying that the origin is outside the convex hull of the points {Fai}. In this case, up to a constant scaling, there are infinitely many nonzero reward functions that generates the optimal policy π1 and it is possible to find a reward function for which the policy π1 is strictly optimal.

These geometric regimes and their implication on the finite state, finite action inverse reinforcement learning problem are summed up in the following theorem.

Theorem 3.1.

There exists a hyperplane passing through the origin such that all the points {Fai} lie on one side of the hyperplane (or on the hyperplane) if and only if there is a non-zero reward function R0 that generates the optimal policy π=a1 for the inverse reinforcement learning problem {S,A,Pa,γ}. i.e. R such that FaiTR0 a,i.

Remark 3.1.

Notice that as an extension of Theorem 3.1, there is an R for which the policy π=a1 is strictly optimal iff there exists a hyperplane for which all the points {Fai} are strictly on one side.

Remark 3.2.

Note that it is possible to find a separating hyperplane between the origin and the collection of points {Fai} if and only if the problem is in Regime 3. Therefore, the problem of inverse reinforcement learning can be viewed as a one class support vector machine (or as a two class support vector machine with the origin as the negative class) problem in this regime. This, along with the objective of determining sample complexity, leads in to the formulation of the problem discussed in the next section.

4 Formulation of optimization problem

The objective function formulation of the inverse reinforcement problem described in [5] was formed by imposing the conditions that the value from the optimal policy was as far as possible from the next best action at each state, as well as sparseness of the reward function. These were choices made by the authors to enable a unique solution to the proposed linear programming problem. We propose a different formulation in terms of a 1 class L1-regularized support vector machine that allows for a geometric interpretation as well as provides an efficient sample complexity. The Inverse Reinforcement Learning problem is now considered in Regime 3. Here it is known that there is a separating hyperplane between the origin and {Fai} so the strict inequality FaiTR>0 which by scaling of R is equivalent to FaiTR1. Formally this assumption is stated as follows

Definition 4.1 (β-Strict Separability).

An inverse reinforcement learning problem {S,A,Pa,γ} satisfies β-strict separability if and only if there exists a {β,R*} such that

R*1=1𝑎𝑛𝑑FaiTR*β>0aAa1,i=1,,n

Notice that the IRL problem is in Regime 3 (i.e. w such that wTFai>0) if and only if the strict separability assumption is satisfied.

Strict nonzero assumptions are well-accepted in the statistical learning theory community, and have been used for instance in compressed sensing [8], Markov random fields [7], nonparametric regression [4], diffusion networks [3].

Figure 2: An example graphical visualization of Regime 2 the origin lies on the boundary of the convex hull of {Fai}. Perturbation from statistical estimation of the transition probability matrices from empirical data (solid red), makes the problem easily tip into Regime 1 (shown) or Regime 3. An infinite number of samples would be required to solve IRL problems falling into Regime 2.

Problems in Regime 2 are avoided since based on the statistical estimation of the transition probability matrices from empirical data, the problem can easily tip into Regime 1 or Regime 3, as shown in Figure 2. To solve problems in Regime 2, an infinite number of samples would be required, where as problems in Regime 3 can be solved with a large enough number of samples.

Given the strict separability assumption, the optimization problem proposed is as follows

maximize𝑅 R1 (4.1)
subject to F^aiTR1aAa1i=1,,n

This problem is in the form of a one class L1-regularized Support Vector Machine [9] except that we use hard margins instead of soft margins. The minimization of the L1 norm plays a two fold role in this formulation. First, it promotes a sparse reward function, keeping in lines with the idea of simplicity. Second, it also plays a role in establishing the sample complexity bounds of the inverse reinforcement learning problem as well, as shown in the subsequent section. The constraints derive from strict Bellman optimality in the separable case (Regime 3) of inverse reinforcement learning and help avoid the degenerate solution of R=0. We now use this optimization problem along with the objective of finding a reward function for which the policy π=a1 is optimal to establish the correctness and sample complexity of the inverse reinforcement learning problem.

5 Correctness and sample complexity of Inverse Reinforcement Learning

Consider the inverse reinforcement learning problem in the strictly separable case (Regime 3). We have {β,R*} such that

FaiTR*β>0aAa1,i=1,,n

Let Fai-F^aiε. Let R^ be the solution to the optimization problem 4.1 with F^ai. We desire that

FaiTR^0aAa1,i=1,,n

i.e. the reward we obtain from the problem using the estimated transition probability matrices also generates π=a1 as the optimal for the problem with the true transition probabilities. This can be done by reducing ε, i.e. by using more samples. The result in the strictly separable case follows from the following theorem.

Theorem 5.1.

Let {S,A,Pa,γ} be an inverse reinforcement learning problem that is β- strictly separable. Let F^ai be the values of Fai using estimates of the transition probability matrices such that Fai-F^aiε. Let R^ be the solution to the optimization problem 4.1 with F^ai. Let 1c0

ε1-c2-cβ

Then we have FaiTR^caAa1,i=1,,n.

Proof.

Consider FaiTR^0, using Hölder’s inequality we have

FaiTR^-Fai-F^aiR^1+F^aiTR^-εR^1+1 (5.1)

Now let R~=KβR* where K>0 and R* is the reward satisfying the β-strict separability for the problem. We have R~1=KβR*1=Kβ as well as FaiTR~K. Now we have

F^aiTR~-Fai-F^aiR~1+FaiTR~-εR~1+K=-Kεβ+K=K(1-εβ)

We now construct R~ to satisfy the constraints of the optimization problem 4.1 with F^ai by choosing K such that

F^aiTR~K(1-εβ)1K=11-εβ

Notice here since we have K>0, then ε<β

Now since R~ is a feasible solution to the optimization problem 4.1 with F^ai for which R^ is the optimal solution, we have from the objective function

R^1R~1=Kβ

Substituting this upper bound for R^1 in (5.1) we get,

FaiTR^-εKβ+1=1-εβ(11-εβ)1-1-c2-c(11-1-c2-c)=1-1-c2-c(2-c)=c

Remark 5.1.

It is important to note that since K,β>0 and ε0 and c1-εKβ, we have c1 with equality holding only when ε=0, i.e infinitely many samples. This shows the equivalence of the problems with the true and the estimated transitions probabilities in the case of infinite samples.

Our desired result then follows as a corollary of the above theorem.

Corollary 5.1.

Let {S,A,Pa,γ} be an inverse reinforcement learning problem that is β- strictly separable. Let F^ai be the values of Fai using estimates of the transition probability matrices such that Fai-F^aiε. Let R^ be the solution to the optimization problem 4.1 with F^ai.

ε12β

Then we have FaiTR^0aAa1,i=1,,n.

Proof.

Straightforwardly, by setting c=0 in Theorem 5.1. ∎

Theorem 5.2.

Let {S,A,Pa,γ} be an inverse reinforcement learning problem that is β- strictly separable. Let every state be reachable from the starting state in one step with probability at least α. Let R^ be the solution to the optimization problem 4.1 with F^ai with transition probability matrices P^a that are maximum likelihood estimates of Pa formed from m samples where

m64αβ2((n-1)γ+1(1-γ)2)2log4nkδ

Then with probability at least (1-δ), we have FaiTR^0aAa1,i=1,,n.

The theorem above follows from concentration inequalities for the estimation of the transition probabilities, which are detailed in the following section. (All missing proofs are included in the Supplementary Material.)

6 Concentration inequalities

In this section we look at the propagation of the concentration of the empirical estimate of the transition probabilities around their true values.

Lemma 6.1.

Let A and B be two matrices, we have

AB|||A|||B

Next we look at the propagation of the concentration of a right stochastic matrix P to the concentration of its k-th power.

Lemma 6.2.

Let P be a n×n right stochastic matrix and let P^ be an estimate of P such that

P^-Pε

then,

P^k-Pk((k-1)n+1)ε

Now we can consider the concentration of the expression Fai=(Pa1(i)-Pa(i))(I-γPa1)-1.

Notice that since P is a right stochastic matrix and γ<1, we can expand (I-γPa1)-1 as (I-γPa1)-1=j=0(γPa1)j and therefore

(Pa1(i)-Pa(i))(I-γPa1)-1=(Pa1(i)-Pa(i))j=0(γPa1)j
Theorem 6.1.

Let Pa and Pa1 be n×n right stochastic matrices corresponding to actions a and a1 and let γ<1. Let P^a and P^a1 be estimates of Pa and Pa1 such that

P^a-Paε𝑎𝑛𝑑P^a1-Pa1ε

Then, a,a1A

(P^a1-P^a)(I-γP^a1)-1-(Pa1-Pa)(I-γPa1)-12ε(n-1)γ+1(1-γ)2

Note that this result is for each action. The concentration over all actions can be found by using the union bound over the set of actions.

An estimate of the value of ε when the estimation is done using m samples can be shown using the Dvoretzky-Kiefer-Wolfowitz inequality [1] to be on the order of εO(2log2nδm).

This result is shown in the following Theorem 6.2.

Theorem 6.2.

Let Pa be a n×n right stochastic matrix for an action aA and let P^a be an maximum likelihood estimate of Pa formed from m samples. If m2ε2log2nδ, then we have

[P^a-Paε]1-δ

The theorem above assumes that it is possible to start in any given state. However, this may not always be the case. In this case, as long as every state is reachable from an initial state with probability at least α, the result presented in Theorem 5.2 can be modified to use Theorem 6.3 instead of Theorem 6.2.

Theorem 6.3.

Let Pa be a n×n right stochastic matrix for an action aA and let P^a be an maximum likelihood estimate of Pa formed from m samples. Let every state be reachable from the starting state in one step with probability at least α. If m4αε2log4nkδ then

[P^a-Paε]1-δ,δ(0,1)aA

7 Discussion

The result of Theorem 5.2 shows that the number of samples required to solve a β-strict separable inverse reinforcement learning problem and obtain a reward that generates the desired optimal policy is on the order of mO(n2β2log(nk)). Notice that the number of samples in inversely proportional to β2. Thus by viewing the case of Regime 2 as limβ0 of the β-strict separable case (Regime 3), it is easy to see that an infinite number of samples are required to guarantee that the reward obtained will generate the optimal policy for the MDP with the true transition probability matrices.

In practical applications, however, it may be difficult to determine if an inverse reinforcement learning problem is β-strict separable (Regime 3) or not. In this case, the result of equation (5.1) can be used as a witness to determine that the obtained R^ satisfies Bellman’s optimality condition with respect to the true transition probability matrices with high probability as shown in the following remark.

Remark 7.1.

Let {S,A,Pa,γ} be an inverse reinforcement learning problem. Let every state be reachable from the starting state in one step with probability at least α. Let R^ be the solution to the optimization problem 4.1 with F^ai with transition probability matrices P^a that are maximum likelihood estimates of Pa formed from m samples and let

ε=24αmlog4nkδ(n-1)γ+1(1-γ)2

If R^11ε, then with probability at least (1-δ), we have FaiTR^0aAa1,i=1,,n.

8 Experimental results

Figure 3: Empirical probability of success versus number of samples for an inverse reinfocement learning problem performed with n=5 states and k=5 actions (Left) and with n=7 states and k=7 actions (Right) using both our L1-regularized support vector machine formulation and the linear programming formulation proposed in [5]. The vertical blue line represents the sample complexity for our method, as stated in Theorem 5.2

Experiments were performed using randomly generated transition probability matrices for β-strictly separable MDPs with n=5 states, k=5 actions, γ=0.1 and with n=7 states, k=7 actions, γ=0.1. Both experiments were done with Pa1 as the optimal policy. Thirty randomly generated MDPs were considered in each case and a varying number of samples were used to find estimates of the transition probability matrices in each trial. Reward functions R^ were found by solving Problem 4.1 for our L1-regularized SVM formulation, and Problem 2.1 for the method of [5], using the same set of estimated transition probabilities, i.e., F^ai. The resulting reward functions were then tested using the true transition probabilities for FaiTR^0. The percentage of trials for which FaiTR^0 held true for both of the methods is shown in Figure 3 for different number of samples used. As prescribed by Theorem 5.2, for β0.0032, the sufficient number of samples for the success of our method is O(n2β2log(nk)). As we can observe, the success rate increases with the number of samples as expected. The L1-regularized support vector machine, however, significantly outperforms the linear programming formulation proposed in [5], reaching 100% success shortly after the sufficient number of samples while the method proposed by [5] falls far behind. The result is that the reward function given by the L1-regularized support vector machine formulation successfully generates the optimal policy π=a1 in almost 100% of the trials given O(n2β2log(nk)) samples while the reward function estimated by the method presented in [5] fails to generate the desired optimal policy.

9 Concluding remarks

The L1-regularized support vector formulation along with the geometric interpretation provide a useful way of looking at the inverse reinforcement learning problem with strong, formal guarantees. Possible future work on this problem includes extension to the inverse reinforcement learning problem with continuous states by using sets of basis functions as presented in [5].

References

  • [1] J. Kiefer A. Dvoretzky and J. Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, pages 642–669, 1956.
  • [2] Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the Twenty-first International Conference on Machine Learning, ICML ’04, pages 1–, New York, NY, USA, 2004. ACM.
  • [3] Hadi Daneshmand, Manuel Gomez-Rodriguez, Le Song, and Bernhard Scholkopf. Estimating diffusion network structures: Recovery conditions, sample complexity & soft-thresholding algorithm. In International Conference on Machine Learning, pages 793–801, 2014.
  • [4] Han Liu, Larry Wasserman, John D Lafferty, and Pradeep K Ravikumar. Spam: Sparse additive models. In Advances in Neural Information Processing Systems, pages 1201–1208, 2008.
  • [5] A. Y. Ng and S.J. Russel. Algorithms for inverse reinforcement learning. In International Conference on Machine Learning, pages 663 – 670, 2000.
  • [6] Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. Urbana, 51(61801):1–4, 2007.
  • [7] Pradeep Ravikumar, Martin J Wainwright, John D Lafferty, et al. High-dimensional ising model selection using l1-regularized logistic regression. The Annals of Statistics, 38(3):1287–1319, 2010.
  • [8] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):2183–2202, 2009.
  • [9] Ji Zhu, Saharon Rosset, Robert Tibshirani, and Trevor J Hastie. 1-norm support vector machines. In Advances in neural information processing systems, pages 49–56, 2004.

Appendix: On the Sample Complexity of Inverse Reinforcement Learning

This appendix contains the proofs for various Lemmas and Theorems presented in the paper.

Appendix A Proofs of Lemmas and Theorems

A.1 Proof of Theorem 3.1

Proof.

The proof follows from the fact that the points {Fai} lie on one side of the hyperplane passing through the origin given by wTx=0 if and only if

wTFai0aAa1,i=1,,n

or

wTFai0aAa1,i=1,,n

The proof in the ’if’ direction follows by taking the hyperplane defined by w=R and noticing that wTFai=RTFai=FaiTR0 so all the points {Fai} lie on one side of the hyperplane passing through the origin given by RTx=0

The proof in the ’only if’ direction is as follows. Consider a separating hyperplane w. Without loss of generality,

wTFai0

Now let R=w then FaiTR=RTFai=wTFai0 so R=w generates the optimal policy π=a1. ∎

A.2 Proof of Theorem 5.2

Proof.

The proof of this theorem is a consequence of Corollary 5.1 and Theorems 6.1 and 6.3. Note that from Theorem 6.3, we want the concentration to hold with probability (1-δ) for all transition probability matrices corresponding to the set of actions. This can be viewed as the concentration inequality holding for a single nk×n matrix which gives us the result for m samples

m4αε2log4nkδ
[P^a-Paε1]<1-δ

The result then follows from substituting this value of ε1 into the ε in Theorem 6.1 and the consequent result into Corollary 5.1. ∎

A.3 Proof of Lemma 6.1

Proof.

Let C=AB, we have cij=kaikbkj

AB=C=supi,j|cij|

From Holder’s inequality we get

AB =supi,j{|kaikbkj|}
supi,j{|kaik|}supi,j{|bik|}
supi,j{k|aik|}supi,j{|bik|}
=|||A|||B

A.4 Proof of Lemma 6.2

Proof.

First note that if P is a right stochastic matrix then Pk is a right stochastic matrix for all natural numbers k. Consider n×n right stochastic matrices A,B,C,D. Consider the expression AC-BD From Lemma 1, we get,

AC-BD =AC-AD+AD-BD
AC-AD+AD-BD
|||A|||C-D+|||A-B|||D

Notice that |||A-B|||nA-B and |||A|||=1 and D1, thus we have

AC-BD C-D+nA-B

Now we will prove the lemma by induction k=1. We have

P^-Pε=((1-1)n+1)ε

Assume the statement for k-1 is true. For k>1 we have

P^(k-1)-P(k-1)(((k-1))-1)n+1)ε

Consider the previous result with A=P^,B=P,C=P^(k-1),D=P(k-1). Substituting, we get

P^P^(k-1)-PP(k-1)(((k-1))-1)n+1)ε+nε
P^(k)-P(k)((k-1)n+1)ε

A.5 Proof of Theorem 6.1

Proof.

Consider the expression from the theorem

(P^a1-P^a)(I-γP^a1)-1-(Pa1-Pa)(I-γPa1)-1
=(P^a1-P^a)j=0(γP^a1)j-(Pa1-Pa)j=0(γPa1)j
=(P^a1-P^a)j=0(γP^a1)j-P^aj=0(γPa1)j+P^aj=0(γPa1)j-(Pa1-Pa)j=0(γPa1)j
=j=0γj(P^a1j+1-Pa1j+1)-(P^a)j=0γj(P^a1j-Pa1j)-(P^a-Pa)j=0γj(Pa1j)
j=0γj(P^a1j+1-Pa1j+1)+j=0γj(P^a)(P^a1j-Pa1j)+j=0γj(P^a-Pa)(Pa1j)

From Lemma 6.1 and Lemma 6.2; and the fact that for a right stochastic matrix P, |||P|||=1 and P1; we have

j=0γj(P^a1j+1-Pa1j+1)+j=0γj(P^a)(P^a1j-Pa1j)+j=0γj(P^a-Pa)(Pa1j)
j=0γj(P^a1j+1-Pa1j+1)+j=0γj|||P^a|||(P^a1j-Pa1j)+j=0γj|||P^a-Pa|||(Pa1j)
j=0γj(P^a1j+1-Pa1j+1)+j=0γj(P^a1j-Pa1j)+j=0γjnP^a-Pa
j=0γj((j)n+1)ε+j=0γj((j-1)n+1)ε+j=0γjnε
=εj=0γj((jn+1)+((j-1)n+1)+n)
=2nεj=0jγj+2εj=0γj
=2ε(nγ(1-γ)2+11-γ)
=2ε(n-1)γ+1(1-γ)2

A.6 Proof of Theorem 6.2

Proof.

Here we invoke The Dvoretzky-Kiefer-Wolfowitz inequality [1]. Consider m samples of a random variable Yia with domain {1,,n}, let yia(1),,yia(m){1,,n} correspond to the observed resulting state under an action a taken at a state i. Let T^ia(s)=1mj=1m𝟙[yia(j)s] be an estimate of the CDF of Yia and let Tia(s)=P[Yias] be the actual CDF. From the Dvoretzky-Kiefer-Wolfowitz inequality we have

(sups{1,,n}|T^ia(s)-Tia(s)|>ε)2e-2mε2
(sups{1,,n}|T^ia(s)-Tia(s)|ε)>1-2e-2mε2

Now consider the PDF of Yia given by 𝐩^ia(s)=T^ia(s)-T^ia(s-1). Notice that

|𝐩^ia(s)-𝐩ia(s)| |(T^ia(s)-T^ia(s-1))-(Tia(s)-Tia(s-1))|
|T^ia(s)-Tia(s)|+|T^ia(s-1)-Tia(s-1)|

So if we have

sups{1,,n}|T^ia(s)-Tia(s)|ε

then

sups{1,,n}|𝐩^ia(s)-𝐩ia(s)|2ε
(sups{1,,n}|𝐩^ia(s)-𝐩ia(s)|ε)>1-2e-mε2/2

Here we can interpret 𝐩^ia() and 𝐩ia() as the i-th rows of the matrices P^a and Pa respectively. 𝐩^(Yia), is the maximum likelihood estimator formed from m samples. From application of the union bound over all rows of the matrix Pa, we have for ε>0, and m samples,

((i1,,n)𝐩^(Yia)-𝐩(Yia)<ε) >1-2ne-mε2/2
[P^a-Paε]1-δ,δ(0,1)

if m2ε2log2nδ

A.7 Proof of Theorem 6.3

Proof.

Without loss of generality, let every state j=1,,n be reachable from state j=1 by action a1 after a step with probability at least α. Let Yja be a random variable domain {1,,n}. Let Zj be a Bernoulli random variable such that P(Zj=1)αj. Let (zj(1),yj(1)),,(zj(m),yj(m)) be m pairs of independent samples of Zj and Yaj. Here Zj represents the state chain 1a1j

Consider the event A1{1mk=1mzj(k)α-ϵj}. By the one-sided Hoeffding’s inequality and taking the union bound over all states we have

(A1)1-ne-2ϵ2m

We also have the conditional maximum likelihood probability estimator

𝐩^(Yj=s|Zj=1)=1k=1mzj(k)l=1m𝟙[(yj(l)=s)zj(l)]

From Theorem 6.2 we have for event

A2{𝐩^(Yja|Zj=1)-𝐩(Yja|Zj=1)β}
(A2|A1)1-2ne-2β2m(α-ϵ)/2

By the law of total probability

(A2)(A2,A1) =(A2|A1)(A1)
=(1-ne-2ϵ2m)(1-2ne-2β2m(α-ϵ)/2)
1-ne-2ϵ2m-2ne-2β2m(α-ϵ)/2

By solving δ2=ne-2ϵ2m and δ2=2ne-2β2m(α-ϵ)/2 we can see that if mmax{12ϵ2log2nδ,2(α-ϵ)β2log4nδ} then (A2)1-δ Letting ϵ=α2 and taking the union bound over all actions aA we have if m4αβ2log4nkδ then

[P^a-Paβ]1-δ,δ(0,1)aA

Appendix B Experiment setup and additional results

Experiments were performed in MATLAB using randomly generated transition probability matrices for β-strictly separable MDPs with n states, k actions, γ.

We generate the rows of the transition probability matrices individually in one of two different ways. In the first method each row Pa(i)=x[0,1]n is generated as a uniformly sampled point from the region 𝒞:={xnx0,x1=1}. In the second method, we wanted to simulate situations where taking an action at a state would lead to a few states (say n1 states) with a large probability and the remaining n-n1 states with a small probability. This is similar to a situation where the action chosen is followed with a large probability and where state jumps to a random state with a small probability. This is based on the idea of following the action with a large probability and randomly "exploring" with a small probability. This is similar to transition rules used by [5] in their experiment. Both of these methods were tested in generating the "true" transition probability matrices. The results shown in Figure 4 of this supplement were obtained using transition probability matrices generated by the first method. The results presented in Figure 3 in the main paper and in Figure 5 of this supplement were obtained for transition probability matrices generated by the second method. We also checked all generated transition probability matrices to ensure β-separability. The maximum likelihood estimates P^ai of these transition probability matrices were formed by sampling trajectories under the true transition probability matrices with the action chosen uniformly at random at each state. Several trajectories were formed, each with a random initial state to ensure that each state was reachable in the simulations.

Recall that Fai=(Pa1(i)-Pa(i))(I-γPa1)-1 and F^ai=(P^a1(i)-P^a(i))(I-γP^a1)-1. Reward functions R^ were found by solving our L1-regularized SVM formulation, and the method of Ng & Russel (2000), using the same set of estimated transition probabilities, i.e., F^ai. The resulting reward functions were then tested using the true transition probabilities for FaiTR^0.

Additional results for 30 repetitions of n=5 states, k=5 actions, separability β=0.0032, with transition probabilities generated using the first method are shown in Figure 4. Results for 20 repetitions of n=10 states, k=10 actions, separability β=0.0032, with transition probabilities generated using the second method are shown in Figure 5.

Figure 4: Empirical probability of success versus number of samples for an inverse reinforcement learning problem performed with n=5 states and k=5 actions using both our L1-regularized support vector machine formulation and the linear programming formulation proposed in [5]. The samples were generated using the first method as described in this supplement. The vertical blue line represents the sample complexity for our method
Figure 5: Empirical probability of success versus number of samples for an inverse reinforcement learning problem performed with n=10 states and k=10 actions using both our L1-regularized support vector machine formulation and the linear programming formulation proposed in [5]. The samples were generated using the second method described in this supplement (i.e., the same method used in the main paper). The vertical blue line represents the sample complexity for our method