Abstract
Inverse reinforcement learning (IRL) is the problem of finding a rewardfunction that generates a given optimal policy for a given Markov DecisionProcess. This paper looks at an algorithmicindependent geometric analysis ofthe IRL problem with finite states and actions. A L1regularized Support VectorMachine formulation of the IRL problem motivated by the geometric analysis isthen proposed with the basic objective of the inverse reinforcement problem inmind: to find a reward function that generates a specified optimal policy. Thepaper further analyzes the proposed formulation of inverse reinforcementlearning with $n$ states and $k$ actions, and shows a sample complexity of$O(n^2 \log (nk))$ for recovering a reward function that generates a policythat satisfies Bellman's optimality condition with respect to the truetransition probabilities.
Quick Read (beta)
On the Correctness and Sample Complexity of Inverse Reinforcement Learning
Abstract
Inverse reinforcement learning (IRL) is the problem of finding a reward function that generates a given optimal policy for a given Markov Decision Process. This paper looks at an algorithmicindependent geometric analysis of the IRL problem with finite states and actions. A L1regularized Support Vector Machine formulation of the IRL problem motivated by the geometric analysis is then proposed with the basic objective of the inverse reinforcement problem in mind: to find a reward function that generates a specified optimal policy. The paper further analyzes the proposed formulation of inverse reinforcement learning with $n$ states and $k$ actions, and shows a sample complexity of $O({n}^{2}\mathrm{log}(nk))$ for recovering a reward function that generates a policy that satisfies Bellman’s optimality condition with respect to the true transition probabilities.
1 Introduction
Reinforcement Learning is the process of generating an optimal policy for a given Markov Decision Process along with a reward function. Often, in situations including apprenticeship learning, the reward function is unknown but optimal policy can be observed through the actions of an expert. In cases such as these, it is desirable to learn a reward function generating the observed optimal policy. This problem is referred to as Inverse Reinforcement Learning (IRL) [5]. It is well known that such a reward function is not necessarily unique. Various algorithms to solve the IRL problem have been made including linear programming [5] and Bayesian estimation [6]. [2] looked at using IRL to solve the apprenticeship learning problem by trying to find a reward function that maximizes the margin of the expert’s policy. The goal of the problem presented in [2] is to find a policy that comes close in value to the expert’s policy for some unspecified true reward function. However none of the prior works provide a formal guarantee that the reward function obtained from the empirical data is optimal for the true transition probabilities in inverse reinforcement learning.
This paper looks at formulating the IRL problem by using the basic objective of inverse reinforcement: to find a reward function that generates a specified optimal policy. The paper also looks at establishing a sample complexity to meet this basic goal when the transition probabilities are estimated from observed trajectories. To achieve this, an algorithmicindependent geometric analysis of the IRL problem with finite states and actions as presented in [5] is provided. A L1regularized Support Vector Machine (SVM) formulation of the IRL problem motivated by the geometric analysis is then proposed. Theoretical analysis of the sample complexity of the L1 SVM formulation is then performed. Finally, experimental results comparing the L1 SVM formulation to the linear programming formulation presented in [5] are presented showing the improved performance of the L1 SVM formulation with respect to the basic objective, i.e Bellman optimality with respect to the true transition probabilities. To the best of our knowledge, we are the first to provide an algorithm with formal guarantees for inverse reinforcement learning.
2 Preliminaries
The formulation of the IRL problem is based on a Markov Decision Process (MDP) $(S,A,\{{P}_{sa}\},\gamma ,R)$, where

•
$S$ is a finite set of $n$ states.

•
$A=\{{a}_{1},\mathrm{\dots},{a}_{k}\}$ is a set of $k$ actions.

•
${P}_{a}\in {[0,1]}^{n\times n}$ are the state transition probabilities for action $a$. We use ${P}_{a}(s)\in {[0,1]}^{n}$ and ${P}_{sa}\in {[0,1]}^{n}$ to represent the state transition probabilities for action $a$ in state $s$ and ${P}_{a}(i,j)\in [0,1]$ to represent the probability of going from state $i$ to state $j$ when taking action $a$.

•
$\gamma \in [0,1]$ is the discount factor.

•
$R:S\to \mathbb{R}$ is the reinforcement or reward function.
It is important to note that the state transition probability matrices are right stochastic. Mathematically this can be stated as ${P}_{a}(i,j)\ge 0\forall i,j\text{and}{\sum}_{j}{P}_{a}(i,j)=1\forall i$
In the subsequent sections, we use the notation
$${F}_{ai}:=({P}_{{a}_{1}}(i){P}_{a}(i)){(I\gamma {P}_{{a}_{1}})}^{1}$$ 
The empirical maximum likelihood estimates of the transition probabilities from sampled trajectories are denoted by ${\widehat{P}}_{a}(i)$ and in a similar fashion we use the notation
$${\widehat{F}}_{ai}:=({\widehat{P}}_{{a}_{1}}(i){\widehat{P}}_{a}(i)){(I\gamma {\widehat{P}}_{{a}_{1}})}^{1}$$ 
The following norms are used throughout this paper. The infinity norm of a matrix $A=[{a}_{ij}]$ is defined as $\parallel A{\parallel}_{\mathrm{\infty}}={sup}_{i,j}{a}_{ij}$. The ${L}^{1}$ norm of a vector is defined as ${\parallel b\parallel}_{1}={\sum}_{i}{b}_{i}$. The induced matrix norm is defined as ${\leftA\right}_{\mathrm{\infty}}={sup}_{j}{\parallel {a}_{j}\parallel}_{1}$ where ${a}_{j}$ is the $j$th row of the matrix $A$. Note that for a right stochastic matrix $P$, we can see that ${\leftP\right}_{\mathrm{\infty}}=1$ and $\parallel P{\parallel}_{\mathrm{\infty}}\le 1$.
In this paper the reward function is assumed to be a function of purely the state instead of the state and the action. This assumption is also made for the initial results in [5]. A policy is defined as a map $\pi :S\to A$. Given a policy $\pi $, we can define two functions.
The value function at a state ${s}_{1}$ is defined as
$${V}^{\pi}({s}_{1})=\mathbb{E}\left[R({s}_{1})+\gamma R(\pi ({s}_{1}))+{\gamma}^{2}R(\pi (\pi ({s}_{1})))+\mathrm{\dots}\mid \pi \right]$$ 
The $Q$ function is defined as
$${Q}^{\pi}(s,a)=R(s)+\gamma {\mathbb{E}}_{{s}^{\prime}\sim {P}_{a}(s)}[{V}^{\pi}({s}^{\prime})]$$ 
From [5], the inverse reinforcement learning problem for finite states and actions can be formulated as the following linear programming problem, assuming without loss of generality that ${\pi}^{*}\equiv {a}_{1}$. By enforcing the Bellman optimality of the policy ${\pi}^{*}$, linear constraints on the reward function are formed. [5] then suggest some "natural" criteria that then forms the basis of the objective function to be minimized to obtained the desired reward function. The formulation presented in [5] is as follows.
$\underset{\mathit{R}}{\text{maximize}}$  $\sum _{i=1}^{N}}\underset{a\in \{{a}_{2},\mathrm{\dots},{a}_{k}\}}{\mathrm{min}}\left({\widehat{F}}_{ai}^{T}R\right)\lambda {\parallel R\parallel}_{1$  (2.1)  
subject to  ${\widehat{F}}_{ai}^{T}R\ge 0\forall a\in A\setminus {a}_{1},i=1,\mathrm{\dots},n$  
${\parallel R\parallel}_{\mathrm{\infty}}\le {R}_{max}$ 
3 Geometric analysis of the IRL problem
The objective of the Inverse Reinforcement Learning problem is to find a reward function that generates an optimal policy. As shown in [5], the necessary and sufficient conditions for a policy ${\pi}^{*}$ (without loss of generality ${\pi}^{*}\equiv {a}_{1}$) to be optimal are given by the Bellman Optimality principle and can be stated mathematically as
$${F}_{ai}^{T}R\ge 0\forall a\in A\setminus {a}_{1},i=1,\mathrm{\dots},n$$ 
Clearly, $R=0$ is always a solution. However this solution is degenerate in the sense that it also allows any and every other policy to be "optimal" and as a result is not of practical use. If the constraint of $R\ne 0$ is considered, then by noticing that the points ${F}_{ai}\in {\mathbb{R}}^{n}$, the set of reward functions generating the optimal policy ${\pi}_{1}$ is then the set of hyperplanes passing through the origin for which the entire collection of points $\left\{{F}_{ai}\right\}$ lie in one half space. The problem of Inverse Reinforcement Learning, then is equivalent to the problem of finding such a separating hyperplane passing through the origin for the points $\left\{{F}_{ai}\right\}$. Here we also assume none of the ${F}_{ai}=0$ as this would mean that there is no distinction between the policies $\pi =a$ and ${\pi}_{1}={a}_{1}$.
This geometric perspective of the IRL problem allows the classification of all finite state, finite action IRL problems into 3 regimes, graphically visualized in Figure 1:
Regime 1: In this regime, there is no hyperplane passing through the origin for which all the points $\left\{{F}_{ai}\right\}$ lie in one half space. This is equivalent to saying that the origin is in the interior of the convex hull of the points $\left\{{F}_{ai}\right\}$. In this case, independent of the algorithm, there is no nonzero reward function for which the policy ${\pi}_{1}$ is optimal.
Regime 2: In this regime, up to scaling by a constant, there can be one or more hyperplanes passing through the origin for which all the points $\left\{{F}_{ai}\right\}$ lie in one half space, however the hyperplanes always contain one of the points $\left\{{F}_{ai}\right\}$. This is equivalent to saying that the origin is on the boundary of the convex hull of the points $\left\{{F}_{ai}\right\}$ but is not one of the vertices since by assumption ${F}_{ai}\ne 0$. In this case, up to a constant scaling, there are one or more nonzero reward functions that generates the optimal policy ${\pi}_{1}$. In this case, it is also important to notice that the policy ${\pi}_{1}$ cannot be strictly optimal for any of the reward functions.
Regime 3: In this regime, up to scaling by a constant, there are infinitely many hyperplanes passing through the origin for which all the points $\left\{{F}_{ai}\right\}$ lie in one half space. This is equivalent to saying that the origin is outside the convex hull of the points $\left\{{F}_{ai}\right\}$. In this case, up to a constant scaling, there are infinitely many nonzero reward functions that generates the optimal policy ${\pi}_{1}$ and it is possible to find a reward function for which the policy ${\pi}_{1}$ is strictly optimal.
These geometric regimes and their implication on the finite state, finite action inverse reinforcement learning problem are summed up in the following theorem.
Theorem 3.1.
There exists a hyperplane passing through the origin such that all the points $\mathrm{\{}{F}_{a\mathit{}i}\mathrm{\}}$ lie on one side of the hyperplane (or on the hyperplane) if and only if there is a nonzero reward function $R\mathrm{\ne}\mathrm{0}$ that generates the optimal policy $\pi \mathrm{=}{a}_{\mathrm{1}}$ for the inverse reinforcement learning problem $\mathrm{\{}S\mathrm{,}A\mathrm{,}{P}_{a}\mathrm{,}\gamma \mathrm{\}}$. i.e. $\mathrm{\exists}R$ such that ${F}_{a\mathit{}i}^{T}\mathit{}R\mathrm{\ge}\mathrm{0}$ $\mathrm{\forall}a\mathrm{,}i$.
Remark 3.1.
Notice that as an extension of Theorem 3.1, there is an $R$ for which the policy $\pi \mathrm{=}{a}_{\mathrm{1}}$ is strictly optimal iff there exists a hyperplane for which all the points $\mathrm{\{}{F}_{a\mathit{}i}\mathrm{\}}$ are strictly on one side.
Remark 3.2.
Note that it is possible to find a separating hyperplane between the origin and the collection of points $\mathrm{\left\{}{F}_{a\mathit{}i}\mathrm{\right\}}$ if and only if the problem is in Regime 3. Therefore, the problem of inverse reinforcement learning can be viewed as a one class support vector machine (or as a two class support vector machine with the origin as the negative class) problem in this regime. This, along with the objective of determining sample complexity, leads in to the formulation of the problem discussed in the next section.
4 Formulation of optimization problem
The objective function formulation of the inverse reinforcement problem described in [5] was formed by imposing the conditions that the value from the optimal policy was as far as possible from the next best action at each state, as well as sparseness of the reward function. These were choices made by the authors to enable a unique solution to the proposed linear programming problem. We propose a different formulation in terms of a 1 class L1regularized support vector machine that allows for a geometric interpretation as well as provides an efficient sample complexity. The Inverse Reinforcement Learning problem is now considered in Regime 3. Here it is known that there is a separating hyperplane between the origin and $\left\{{F}_{ai}\right\}$ so the strict inequality ${F}_{ai}^{T}R>0$ which by scaling of $R$ is equivalent to ${F}_{ai}^{T}R\ge 1$. Formally this assumption is stated as follows
Definition 4.1 ($\beta $Strict Separability).
An inverse reinforcement learning problem $\mathrm{\{}S\mathrm{,}A\mathrm{,}{P}_{a}\mathrm{,}\gamma \mathrm{\}}$ satisfies $\beta $strict separability if and only if there exists a $\mathrm{\{}\beta \mathrm{,}{R}^{\mathrm{*}}\mathrm{\}}$ such that
$${\parallel {R}^{*}\parallel}_{1}=1\text{\mathit{a}\mathit{n}\mathit{d}}{F}_{ai}^{T}{R}^{*}\ge \beta >0\forall a\in A\setminus {a}_{1},i=1,\mathrm{\dots},n$$ 
Notice that the IRL problem is in Regime 3 (i.e. $\exists w$ such that ${w}^{T}{F}_{ai}>0$) if and only if the strict separability assumption is satisfied.
Strict nonzero assumptions are wellaccepted in the statistical learning theory community, and have been used for instance in compressed sensing [8], Markov random fields [7], nonparametric regression [4], diffusion networks [3].
Problems in Regime 2 are avoided since based on the statistical estimation of the transition probability matrices from empirical data, the problem can easily tip into Regime 1 or Regime 3, as shown in Figure 2. To solve problems in Regime 2, an infinite number of samples would be required, where as problems in Regime 3 can be solved with a large enough number of samples.
Given the strict separability assumption, the optimization problem proposed is as follows
$\underset{\mathit{R}}{\text{maximize}}$  ${\parallel R\parallel}_{1}$  (4.1)  
subject to  ${\widehat{F}}_{ai}^{T}R\ge 1\forall a\in A\setminus {a}_{1}i=1,\mathrm{\dots},n$ 
This problem is in the form of a one class L1regularized Support Vector Machine [9] except that we use hard margins instead of soft margins. The minimization of the ${L}^{1}$ norm plays a two fold role in this formulation. First, it promotes a sparse reward function, keeping in lines with the idea of simplicity. Second, it also plays a role in establishing the sample complexity bounds of the inverse reinforcement learning problem as well, as shown in the subsequent section. The constraints derive from strict Bellman optimality in the separable case (Regime 3) of inverse reinforcement learning and help avoid the degenerate solution of $R=0$. We now use this optimization problem along with the objective of finding a reward function for which the policy $\pi ={a}_{1}$ is optimal to establish the correctness and sample complexity of the inverse reinforcement learning problem.
5 Correctness and sample complexity of Inverse Reinforcement Learning
Consider the inverse reinforcement learning problem in the strictly separable case (Regime 3). We have $\exists \{\beta ,{R}^{*}\}$ such that
$${F}_{ai}^{T}{R}^{*}\ge \beta >0\forall a\in A\setminus {a}_{1},i=1,\mathrm{\dots},n$$ 
Let ${\parallel {F}_{ai}{\widehat{F}}_{ai}\parallel}_{\mathrm{\infty}}\le \epsilon $. Let $\widehat{R}$ be the solution to the optimization problem 4.1 with ${\widehat{F}}_{ai}$. We desire that
$${F}_{ai}^{T}\widehat{R}\ge 0\forall a\in A\setminus {a}_{1},i=1,\mathrm{\dots},n$$ 
i.e. the reward we obtain from the problem using the estimated transition probability matrices also generates $\pi ={a}_{1}$ as the optimal for the problem with the true transition probabilities. This can be done by reducing $\epsilon $, i.e. by using more samples. The result in the strictly separable case follows from the following theorem.
Theorem 5.1.
Let $\mathrm{\{}S\mathrm{,}A\mathrm{,}{P}_{a}\mathrm{,}\gamma \mathrm{\}}$ be an inverse reinforcement learning problem that is $\beta $ strictly separable. Let ${\widehat{F}}_{a\mathit{}i}$ be the values of ${F}_{a\mathit{}i}$ using estimates of the transition probability matrices such that ${\mathrm{\parallel}{F}_{a\mathit{}i}\mathrm{}{\widehat{F}}_{a\mathit{}i}\mathrm{\parallel}}_{\mathrm{\infty}}\mathrm{\le}\epsilon $. Let $\widehat{R}$ be the solution to the optimization problem 4.1 with ${\widehat{F}}_{a\mathit{}i}$. Let $\mathrm{1}\mathrm{\ge}c\mathrm{\ge}\mathrm{0}$
$$\epsilon \le \frac{1c}{2c}\beta $$ 
Then we have ${F}_{a\mathit{}i}^{T}\mathit{}\widehat{R}\mathrm{\ge}c\mathit{}\mathrm{\forall}a\mathrm{\in}A\mathrm{\setminus}{a}_{\mathrm{1}}\mathrm{,}i\mathrm{=}\mathrm{1}\mathrm{,}\mathrm{\dots}\mathrm{,}n$.
Proof.
Consider ${F}_{ai}^{T}\widehat{R}\ge 0$, using Hölder’s inequality we have
$${F}_{ai}^{T}\widehat{R}\ge {\parallel {F}_{ai}{\widehat{F}}_{ai}\parallel}_{\mathrm{\infty}}{\parallel \widehat{R}\parallel}_{1}+{\widehat{F}}_{ai}^{T}\widehat{R}\ge \epsilon {\parallel \widehat{R}\parallel}_{1}+1$$  (5.1) 
Now let $\stackrel{~}{R}=\frac{K}{\beta}{R}^{*}$ where $K>0$ and ${R}^{*}$ is the reward satisfying the $\beta $strict separability for the problem. We have ${\parallel \stackrel{~}{R}\parallel}_{1}=\frac{K}{\beta}{\parallel {R}^{*}\parallel}_{1}=\frac{K}{\beta}$ as well as ${F}_{ai}^{T}\stackrel{~}{R}\ge K$. Now we have
$${\widehat{F}}_{ai}^{T}\stackrel{~}{R}\ge {\parallel {F}_{ai}{\widehat{F}}_{ai}\parallel}_{\mathrm{\infty}}{\parallel \stackrel{~}{R}\parallel}_{1}+{F}_{ai}^{T}\stackrel{~}{R}\ge \epsilon {\parallel \stackrel{~}{R}\parallel}_{1}+K=\frac{K\epsilon}{\beta}+K=K\left(1\frac{\epsilon}{\beta}\right)$$ 
We now construct $\stackrel{~}{R}$ to satisfy the constraints of the optimization problem 4.1 with ${\widehat{F}}_{ai}$ by choosing $K$ such that
$${\widehat{F}}_{ai}^{T}\stackrel{~}{R}\ge K\left(1\frac{\epsilon}{\beta}\right)\ge 1\u27f9K=\frac{1}{1\frac{\epsilon}{\beta}}$$ 
Notice here since we have $K>0$, then $$
Now since $\stackrel{~}{R}$ is a feasible solution to the optimization problem 4.1 with ${\widehat{F}}_{ai}$ for which $\widehat{R}$ is the optimal solution, we have from the objective function
$${\parallel \widehat{R}\parallel}_{1}\le {\parallel \stackrel{~}{R}\parallel}_{1}=\frac{K}{\beta}$$ 
Substituting this upper bound for ${\parallel \widehat{R}\parallel}_{1}$ in (5.1) we get,
$${F}_{ai}^{T}\widehat{R}\ge \epsilon \frac{K}{\beta}+1=1\frac{\epsilon}{\beta}\left(\frac{1}{1\frac{\epsilon}{\beta}}\right)\ge 1\frac{1c}{2c}\left(\frac{1}{1\frac{1c}{2c}}\right)=1\frac{1c}{2c}(2c)=c$$ 
∎
Remark 5.1.
It is important to note that since $K\mathrm{,}\beta \mathrm{>}\mathrm{0}$ and $\epsilon \mathrm{\ge}\mathrm{0}$ and $c\mathrm{\le}\mathrm{1}\mathrm{}\epsilon \mathit{}\frac{K}{\beta}$, we have $c\mathrm{\le}\mathrm{1}$ with equality holding only when $\epsilon \mathrm{=}\mathrm{0}$, i.e infinitely many samples. This shows the equivalence of the problems with the true and the estimated transitions probabilities in the case of infinite samples.
Our desired result then follows as a corollary of the above theorem.
Corollary 5.1.
Let $\mathrm{\{}S\mathrm{,}A\mathrm{,}{P}_{a}\mathrm{,}\gamma \mathrm{\}}$ be an inverse reinforcement learning problem that is $\beta $ strictly separable. Let ${\widehat{F}}_{a\mathit{}i}$ be the values of ${F}_{a\mathit{}i}$ using estimates of the transition probability matrices such that ${\mathrm{\parallel}{F}_{a\mathit{}i}\mathrm{}{\widehat{F}}_{a\mathit{}i}\mathrm{\parallel}}_{\mathrm{\infty}}\mathrm{\le}\epsilon $. Let $\widehat{R}$ be the solution to the optimization problem 4.1 with ${\widehat{F}}_{a\mathit{}i}$.
$$\epsilon \le \frac{1}{2}\beta $$ 
Then we have ${F}_{a\mathit{}i}^{T}\mathit{}\widehat{R}\mathrm{\ge}\mathrm{0}\mathit{}\mathrm{\forall}a\mathrm{\in}A\mathrm{\setminus}{a}_{\mathrm{1}}\mathrm{,}i\mathrm{=}\mathrm{1}\mathrm{,}\mathrm{\dots}\mathrm{,}n$.
Proof.
Straightforwardly, by setting $c=0$ in Theorem 5.1. ∎
Theorem 5.2.
Let $\mathrm{\{}S\mathrm{,}A\mathrm{,}{P}_{a}\mathrm{,}\gamma \mathrm{\}}$ be an inverse reinforcement learning problem that is $\beta $ strictly separable. Let every state be reachable from the starting state in one step with probability at least $\alpha $. Let $\widehat{R}$ be the solution to the optimization problem 4.1 with ${\widehat{F}}_{a\mathit{}i}$ with transition probability matrices ${\widehat{P}}_{a}$ that are maximum likelihood estimates of ${P}_{a}$ formed from $m$ samples where
$$m\ge \frac{64}{\alpha {\beta}^{2}}{\left(\frac{(n1)\gamma +1}{{(1\gamma )}^{2}}\right)}^{2}\mathrm{log}\frac{4nk}{\delta}$$ 
Then with probability at least $\mathrm{(}\mathrm{1}\mathrm{}\delta \mathrm{)}$, we have ${F}_{a\mathit{}i}^{T}\mathit{}\widehat{R}\mathrm{\ge}\mathrm{0}\mathit{}\mathrm{\forall}a\mathrm{\in}A\mathrm{\setminus}{a}_{\mathrm{1}}\mathrm{,}i\mathrm{=}\mathrm{1}\mathrm{,}\mathrm{\dots}\mathrm{,}n$.
The theorem above follows from concentration inequalities for the estimation of the transition probabilities, which are detailed in the following section. (All missing proofs are included in the Supplementary Material.)
6 Concentration inequalities
In this section we look at the propagation of the concentration of the empirical estimate of the transition probabilities around their true values.
Lemma 6.1.
Let A and B be two matrices, we have
$$\parallel AB\parallel {}_{\mathrm{\infty}}\le \left\right\leftA\right\left\right{}_{\mathrm{\infty}}\parallel B\parallel {}_{\mathrm{\infty}}$$ 
Next we look at the propagation of the concentration of a right stochastic matrix $P$ to the concentration of its $k$th power.
Lemma 6.2.
Let $P$ be a $n\mathrm{\times}n$ right stochastic matrix and let $\widehat{P}$ be an estimate of $P$ such that
$${\parallel \widehat{P}P\parallel}_{\mathrm{\infty}}\le \epsilon $$ 
then,
$${\parallel {\widehat{P}}^{k}{P}^{k}\parallel}_{\mathrm{\infty}}\le ((k1)n+1)\epsilon $$ 
Now we can consider the concentration of the expression ${F}_{ai}=({P}_{{a}_{1}}(i){P}_{a}(i)){(I\gamma {P}_{{a}_{1}})}^{1}$.
Notice that since $P$ is a right stochastic matrix and $$, we can expand ${(I\gamma {P}_{{a}_{1}})}^{1}$ as ${(I\gamma {P}_{{a}_{1}})}^{1}={\sum}_{j=0}^{\mathrm{\infty}}{\left(\gamma {P}_{{a}_{1}}\right)}^{j}$ and therefore
$$({P}_{{a}_{1}}(i){P}_{a}(i)){(I\gamma {P}_{{a}_{1}})}^{1}=({P}_{{a}_{1}}(i){P}_{a}(i))\sum _{j=0}^{\mathrm{\infty}}{\left(\gamma {P}_{{a}_{1}}\right)}^{j}$$ 
Theorem 6.1.
Let ${P}_{a}$ and ${P}_{{a}_{\mathrm{1}}}$ be $n\mathrm{\times}n$ right stochastic matrices corresponding to actions $a$ and ${a}_{\mathrm{1}}$ and let $$. Let ${\widehat{P}}_{a}$ and ${\widehat{P}}_{{a}_{\mathrm{1}}}$ be estimates of ${P}_{a}$ and ${P}_{{a}_{\mathrm{1}}}$ such that
$${\parallel {\widehat{P}}_{a}{P}_{a}\parallel}_{\mathrm{\infty}}\le \epsilon \text{\mathit{a}\mathit{n}\mathit{d}}{\parallel {\widehat{P}}_{{a}_{1}}{P}_{{a}_{1}}\parallel}_{\mathrm{\infty}}\le \epsilon $$ 
Then, $\mathrm{\forall}a\mathrm{,}{a}_{\mathrm{1}}\mathrm{\in}A$
$${\parallel ({\widehat{P}}_{{a}_{1}}{\widehat{P}}_{a}){(I\gamma {\widehat{P}}_{{a}_{1}})}^{1}({P}_{{a}_{1}}{P}_{a}){(I\gamma {P}_{{a}_{1}})}^{1}\parallel}_{\mathrm{\infty}}\le 2\epsilon \frac{(n1)\gamma +1}{{(1\gamma )}^{2}}$$ 
Note that this result is for each action. The concentration over all actions can be found by using the union bound over the set of actions.
An estimate of the value of $\epsilon $ when the estimation is done using $m$ samples can be shown using the DvoretzkyKieferWolfowitz inequality [1] to be on the order of $\epsilon \in O\left(\sqrt{\frac{2\mathrm{log}\frac{2n}{\delta}}{m}}\right)$.
This result is shown in the following Theorem 6.2.
Theorem 6.2.
Let ${P}_{a}$ be a $n\mathrm{\times}n$ right stochastic matrix for an action $a\mathrm{\in}A$ and let ${\widehat{P}}_{a}$ be an maximum likelihood estimate of ${P}_{a}$ formed from $m$ samples. If $m\mathrm{\ge}\frac{\mathrm{2}}{{\epsilon}^{\mathrm{2}}}\mathit{}\mathrm{log}\mathit{}\frac{\mathrm{2}\mathit{}n}{\delta}$, then we have
$$\mathbb{P}[{\parallel {\widehat{P}}_{a}{P}_{a}\parallel}_{\mathrm{\infty}}\le \epsilon ]\ge 1\delta $$ 
The theorem above assumes that it is possible to start in any given state. However, this may not always be the case. In this case, as long as every state is reachable from an initial state with probability at least $\alpha $, the result presented in Theorem 5.2 can be modified to use Theorem 6.3 instead of Theorem 6.2.
Theorem 6.3.
Let ${P}_{a}$ be a $n\mathrm{\times}n$ right stochastic matrix for an action $a\mathrm{\in}A$ and let ${\widehat{P}}_{a}$ be an maximum likelihood estimate of ${P}_{a}$ formed from $m$ samples. Let every state be reachable from the starting state in one step with probability at least $\alpha $. If $m\mathrm{\ge}\frac{\mathrm{4}}{\alpha \mathit{}{\epsilon}^{\mathrm{2}}}\mathit{}\mathrm{log}\mathit{}\frac{\mathrm{4}\mathit{}n\mathit{}k}{\delta}$ then
$$\mathbb{P}[{\parallel {\widehat{P}}_{a}{P}_{a}\parallel}_{\mathrm{\infty}}\le \epsilon ]\ge 1\delta ,\delta \in (0,1)\forall a\in A$$ 
7 Discussion
The result of Theorem 5.2 shows that the number of samples required to solve a $\beta $strict separable inverse reinforcement learning problem and obtain a reward that generates the desired optimal policy is on the order of $m\in O\left(\frac{{n}^{2}}{{\beta}^{2}}\mathrm{log}(nk)\right)$. Notice that the number of samples in inversely proportional to ${\beta}^{2}$. Thus by viewing the case of Regime 2 as $lim\beta \to 0$ of the $\beta $strict separable case (Regime 3), it is easy to see that an infinite number of samples are required to guarantee that the reward obtained will generate the optimal policy for the MDP with the true transition probability matrices.
In practical applications, however, it may be difficult to determine if an inverse reinforcement learning problem is $\beta $strict separable (Regime 3) or not. In this case, the result of equation (5.1) can be used as a witness to determine that the obtained $\widehat{R}$ satisfies Bellman’s optimality condition with respect to the true transition probability matrices with high probability as shown in the following remark.
Remark 7.1.
Let $\mathrm{\{}S\mathrm{,}A\mathrm{,}{P}_{a}\mathrm{,}\gamma \mathrm{\}}$ be an inverse reinforcement learning problem. Let every state be reachable from the starting state in one step with probability at least $\alpha $. Let $\widehat{R}$ be the solution to the optimization problem 4.1 with ${\widehat{F}}_{a\mathit{}i}$ with transition probability matrices ${\widehat{P}}_{a}$ that are maximum likelihood estimates of ${P}_{a}$ formed from $m$ samples and let
$$\epsilon =2\sqrt{\frac{4}{\alpha m}\mathrm{log}\frac{4nk}{\delta}}\cdot \frac{(n1)\gamma +1}{{(1\gamma )}^{2}}$$ 
If ${\mathrm{\parallel}\widehat{R}\mathrm{\parallel}}_{\mathrm{1}}\mathrm{\ll}\frac{\mathrm{1}}{\epsilon}$, then with probability at least $\mathrm{(}\mathrm{1}\mathrm{}\delta \mathrm{)}$, we have ${F}_{a\mathit{}i}^{T}\mathit{}\widehat{R}\mathrm{\ge}\mathrm{0}\mathit{}\mathrm{\forall}a\mathrm{\in}A\mathrm{\setminus}{a}_{\mathrm{1}}\mathrm{,}i\mathrm{=}\mathrm{1}\mathrm{,}\mathrm{\dots}\mathrm{,}n$.
8 Experimental results
Experiments were performed using randomly generated transition probability matrices for $\beta $strictly separable MDPs with $n=5$ states, $k=5$ actions, $\gamma =0.1$ and with $n=7$ states, $k=7$ actions, $\gamma =0.1$. Both experiments were done with ${P}_{a1}$ as the optimal policy. Thirty randomly generated MDPs were considered in each case and a varying number of samples were used to find estimates of the transition probability matrices in each trial. Reward functions $\widehat{R}$ were found by solving Problem 4.1 for our L1regularized SVM formulation, and Problem 2.1 for the method of [5], using the same set of estimated transition probabilities, i.e., ${\widehat{F}}_{ai}$. The resulting reward functions were then tested using the true transition probabilities for ${F}_{ai}^{T}\widehat{R}\ge 0$. The percentage of trials for which ${F}_{ai}^{T}\widehat{R}\ge 0$ held true for both of the methods is shown in Figure 3 for different number of samples used. As prescribed by Theorem 5.2, for $\beta \approx 0.0032$, the sufficient number of samples for the success of our method is $O\left(\frac{{n}^{2}}{{\beta}^{2}}\mathrm{log}(nk)\right)$. As we can observe, the success rate increases with the number of samples as expected. The L1regularized support vector machine, however, significantly outperforms the linear programming formulation proposed in [5], reaching $100\%$ success shortly after the sufficient number of samples while the method proposed by [5] falls far behind. The result is that the reward function given by the L1regularized support vector machine formulation successfully generates the optimal policy $\pi ={a}_{1}$ in almost $100\%$ of the trials given $O\left(\frac{{n}^{2}}{{\beta}^{2}}\mathrm{log}(nk)\right)$ samples while the reward function estimated by the method presented in [5] fails to generate the desired optimal policy.
9 Concluding remarks
The L1regularized support vector formulation along with the geometric interpretation provide a useful way of looking at the inverse reinforcement learning problem with strong, formal guarantees. Possible future work on this problem includes extension to the inverse reinforcement learning problem with continuous states by using sets of basis functions as presented in [5].
References
 [1] J. Kiefer A. Dvoretzky and J. Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, pages 642–669, 1956.
 [2] Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the Twentyfirst International Conference on Machine Learning, ICML ’04, pages 1–, New York, NY, USA, 2004. ACM.
 [3] Hadi Daneshmand, Manuel GomezRodriguez, Le Song, and Bernhard Scholkopf. Estimating diffusion network structures: Recovery conditions, sample complexity & softthresholding algorithm. In International Conference on Machine Learning, pages 793–801, 2014.
 [4] Han Liu, Larry Wasserman, John D Lafferty, and Pradeep K Ravikumar. Spam: Sparse additive models. In Advances in Neural Information Processing Systems, pages 1201–1208, 2008.
 [5] A. Y. Ng and S.J. Russel. Algorithms for inverse reinforcement learning. In International Conference on Machine Learning, pages 663 – 670, 2000.
 [6] Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. Urbana, 51(61801):1–4, 2007.
 [7] Pradeep Ravikumar, Martin J Wainwright, John D Lafferty, et al. Highdimensional ising model selection using l1regularized logistic regression. The Annals of Statistics, 38(3):1287–1319, 2010.
 [8] Martin J Wainwright. Sharp thresholds for highdimensional and noisy sparsity recovery using l1constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):2183–2202, 2009.
 [9] Ji Zhu, Saharon Rosset, Robert Tibshirani, and Trevor J Hastie. 1norm support vector machines. In Advances in neural information processing systems, pages 49–56, 2004.
Appendix: On the Sample Complexity of Inverse Reinforcement Learning
This appendix contains the proofs for various Lemmas and Theorems presented in the paper.
Appendix A Proofs of Lemmas and Theorems
A.1 Proof of Theorem 3.1
Proof.
The proof follows from the fact that the points $\{{F}_{ai}\}$ lie on one side of the hyperplane passing through the origin given by ${w}^{T}x=0$ if and only if
$${w}^{T}{F}_{ai}\ge 0\forall a\in A\setminus {a}_{1},i=1,\mathrm{\dots},n$$ 
or
$${w}^{T}{F}_{ai}\le 0\forall a\in A\setminus {a}_{1},i=1,\mathrm{\dots},n$$ 
The proof in the ’if’ direction follows by taking the hyperplane defined by $w=R$ and noticing that ${w}^{T}{F}_{ai}={R}^{T}{F}_{ai}={F}_{ai}^{T}R\ge 0$ so all the points $\{{F}_{ai}\}$ lie on one side of the hyperplane passing through the origin given by ${R}^{T}x=0$
The proof in the ’only if’ direction is as follows. Consider a separating hyperplane $w$. Without loss of generality,
$${w}^{T}{F}_{ai}\ge 0$$ 
Now let $R=w$ then ${F}_{ai}^{T}R={R}^{T}{F}_{ai}={w}^{T}{F}_{ai}\ge 0$ so $R=w$ generates the optimal policy $\pi ={a}_{1}$. ∎
A.2 Proof of Theorem 5.2
Proof.
The proof of this theorem is a consequence of Corollary 5.1 and Theorems 6.1 and 6.3. Note that from Theorem 6.3, we want the concentration to hold with probability $(1\delta )$ for all transition probability matrices corresponding to the set of actions. This can be viewed as the concentration inequality holding for a single $nk\times n$ matrix which gives us the result for $m$ samples
$$m\ge \frac{4}{\alpha {\epsilon}^{2}}\mathrm{log}\frac{4nk}{\delta}$$ 
$$ 
The result then follows from substituting this value of ${\epsilon}_{1}$ into the $\epsilon $ in Theorem 6.1 and the consequent result into Corollary 5.1. ∎
A.3 Proof of Lemma 6.1
Proof.
Let $C=AB$, we have ${c}_{ij}={\sum}_{k}{a}_{ik}{b}_{kj}$
$${\parallel AB\parallel}_{\mathrm{\infty}}=\parallel C\parallel {}_{\mathrm{\infty}}=sup{}_{i,j}c{}_{ij}$$ 
From Holder’s inequality we get
$\parallel AB\parallel {}_{\mathrm{\infty}}$  $=\underset{i,j}{sup}\left\{\left{\displaystyle \sum _{k}}{a}_{ik}{b}_{kj}\right\right\}$  
$\le \underset{i,j}{sup}\left\{\left{\displaystyle \sum _{k}}{a}_{ik}\right\right\}\underset{i,j}{sup}\left\{\left{b}_{ik}\right\right\}$  
$\le \underset{i,j}{sup}\left\{{\displaystyle \sum _{k}}\left{a}_{ik}\right\right\}\underset{i,j}{sup}\left\{\left{b}_{ik}\right\right\}$  
$={\leftA\right}_{\mathrm{\infty}}{\parallel B\parallel}_{\mathrm{\infty}}$ 
∎
A.4 Proof of Lemma 6.2
Proof.
First note that if $P$ is a right stochastic matrix then ${P}^{k}$ is a right stochastic matrix for all natural numbers $k$. Consider $n\times n$ right stochastic matrices $A,B,C,D$. Consider the expression ${\parallel ACBD\parallel}_{\mathrm{\infty}}$ From Lemma 1, we get,
${\parallel ACBD\parallel}_{\mathrm{\infty}}$  $={\parallel ACAD+ADBD\parallel}_{\mathrm{\infty}}$  
$\le {\parallel ACAD\parallel}_{\mathrm{\infty}}+{\parallel ADBD\parallel}_{\mathrm{\infty}}$  
$\le {\leftA\right}_{\mathrm{\infty}}{\parallel CD\parallel}_{\mathrm{\infty}}+{\leftAB\right}_{\mathrm{\infty}}{\parallel D\parallel}_{\mathrm{\infty}}$ 
Notice that ${\leftAB\right}_{\mathrm{\infty}}\le n{\parallel AB\parallel}_{\mathrm{\infty}}$ and ${\leftA\right}_{\mathrm{\infty}}=1$ and ${\parallel D\parallel}_{\mathrm{\infty}}\le 1$, thus we have
${\parallel ACBD\parallel}_{\mathrm{\infty}}$  $\le {\parallel CD\parallel}_{\mathrm{\infty}}+n{\parallel AB\parallel}_{\mathrm{\infty}}$ 
Now we will prove the lemma by induction $k=1$. We have
$${\parallel \widehat{P}P\parallel}_{\mathrm{\infty}}\le \epsilon =((11)n+1)\epsilon $$ 
Assume the statement for $k1$ is true. For $k>1$ we have
$${\parallel {\widehat{P}}^{(k1)}{P}^{(k1)}\parallel}_{\mathrm{\infty}}\le (((k1))1)n+1)\epsilon $$ 
Consider the previous result with $A=\widehat{P},B=P,C={\widehat{P}}^{(k1)},D={P}^{(k1)}$. Substituting, we get
$${\parallel \widehat{P}{\widehat{P}}^{(k1)}P{P}^{(k1)}\parallel}_{\mathrm{\infty}}\le (((k1))1)n+1)\epsilon +n\epsilon $$ 
$$\u27f9{\parallel {\widehat{P}}^{(k)}{P}^{(k)}\parallel}_{\mathrm{\infty}}\le ((k1)n+1)\epsilon $$ 
∎
A.5 Proof of Theorem 6.1
Proof.
Consider the expression from the theorem
$${\parallel ({\widehat{P}}_{{a}_{1}}{\widehat{P}}_{a}){(I\gamma {\widehat{P}}_{{a}_{1}})}^{1}({P}_{{a}_{1}}{P}_{a}){(I\gamma {P}_{{a}_{1}})}^{1}\parallel}_{\mathrm{\infty}}$$ 
$$={\parallel ({\widehat{P}}_{{a}_{1}}{\widehat{P}}_{a})\sum _{j=0}^{\mathrm{\infty}}{\left(\gamma {\widehat{P}}_{{a}_{1}}\right)}^{j}({P}_{{a}_{1}}{P}_{a})\sum _{j=0}^{\mathrm{\infty}}{\left(\gamma {P}_{{a}_{1}}\right)}^{j}\parallel}_{\mathrm{\infty}}$$ 
$$={\parallel ({\widehat{P}}_{{a}_{1}}{\widehat{P}}_{a})\sum _{j=0}^{\mathrm{\infty}}{\left(\gamma {\widehat{P}}_{{a}_{1}}\right)}^{j}{\widehat{P}}_{a}\sum _{j=0}^{\mathrm{\infty}}{\left(\gamma {P}_{{a}_{1}}\right)}^{j}+{\widehat{P}}_{a}\sum _{j=0}^{\mathrm{\infty}}{\left(\gamma {P}_{{a}_{1}}\right)}^{j}({P}_{{a}_{1}}{P}_{a})\sum _{j=0}^{\mathrm{\infty}}{\left(\gamma {P}_{{a}_{1}}\right)}^{j}\parallel}_{\mathrm{\infty}}$$ 
$$={\parallel \sum _{j=0}^{\mathrm{\infty}}{\gamma}^{j}\left({\widehat{P}}_{{a}_{1}}^{j+1}{P}_{{a}_{1}}^{j+1}\right)({\widehat{P}}_{a})\sum _{j=0}^{\mathrm{\infty}}{\gamma}^{j}\left({\widehat{P}}_{{a}_{1}}^{j}{P}_{{a}_{1}}^{j}\right)({\widehat{P}}_{a}{P}_{a})\sum _{j=0}^{\mathrm{\infty}}{\gamma}^{j}\left({P}_{{a}_{1}}^{j}\right)\parallel}_{\mathrm{\infty}}$$ 
$$\le \sum _{j=0}^{\mathrm{\infty}}{\gamma}^{j}{\parallel \left({\widehat{P}}_{{a}_{1}}^{j+1}{P}_{{a}_{1}}^{j+1}\right)\parallel}_{\mathrm{\infty}}+\sum _{j=0}^{\mathrm{\infty}}{\gamma}^{j}{\parallel ({\widehat{P}}_{a})\left({\widehat{P}}_{{a}_{1}}^{j}{P}_{{a}_{1}}^{j}\right)\parallel}_{\mathrm{\infty}}+\sum _{j=0}^{\mathrm{\infty}}{\gamma}^{j}{\parallel ({\widehat{P}}_{a}{P}_{a})\left({P}_{{a}_{1}}^{j}\right)\parallel}_{\mathrm{\infty}}$$ 
From Lemma 6.1 and Lemma 6.2; and the fact that for a right stochastic matrix $P$, ${\leftP\right}_{\mathrm{\infty}}=1$ and $\parallel P{\parallel}_{\mathrm{\infty}}\le 1$; we have
$$\sum _{j=0}^{\mathrm{\infty}}{\gamma}^{j}{\parallel \left({\widehat{P}}_{{a}_{1}}^{j+1}{P}_{{a}_{1}}^{j+1}\right)\parallel}_{\mathrm{\infty}}+\sum _{j=0}^{\mathrm{\infty}}{\gamma}^{j}{\parallel ({\widehat{P}}_{a})\left({\widehat{P}}_{{a}_{1}}^{j}{P}_{{a}_{1}}^{j}\right)\parallel}_{\mathrm{\infty}}+\sum _{j=0}^{\mathrm{\infty}}{\gamma}^{j}{\parallel ({\widehat{P}}_{a}{P}_{a})\left({P}_{{a}_{1}}^{j}\right)\parallel}_{\mathrm{\infty}}$$ 
$$\le \sum _{j=0}^{\mathrm{\infty}}{\gamma}^{j}{\parallel \left({\widehat{P}}_{{a}_{1}}^{j+1}{P}_{{a}_{1}}^{j+1}\right)\parallel}_{\mathrm{\infty}}+\sum _{j=0}^{\mathrm{\infty}}{\gamma}^{j}{\left{\widehat{P}}_{a}\right}_{\mathrm{\infty}}{\parallel \left({\widehat{P}}_{{a}_{1}}^{j}{P}_{{a}_{1}}^{j}\right)\parallel}_{\mathrm{\infty}}+\sum _{j=0}^{\mathrm{\infty}}{\gamma}^{j}{\left{\widehat{P}}_{a}{P}_{a}\right}_{\mathrm{\infty}}{\parallel \left({P}_{{a}_{1}}^{j}\right)\parallel}_{\mathrm{\infty}}$$ 
$$\le \sum _{j=0}^{\mathrm{\infty}}{\gamma}^{j}{\parallel \left({\widehat{P}}_{{a}_{1}}^{j+1}{P}_{{a}_{1}}^{j+1}\right)\parallel}_{\mathrm{\infty}}+\sum _{j=0}^{\mathrm{\infty}}{\gamma}^{j}{\parallel \left({\widehat{P}}_{{a}_{1}}^{j}{P}_{{a}_{1}}^{j}\right)\parallel}_{\mathrm{\infty}}+\sum _{j=0}^{\mathrm{\infty}}{\gamma}^{j}n{\parallel {\widehat{P}}_{a}{P}_{a}\parallel}_{\mathrm{\infty}}$$ 
$\le {\displaystyle \sum _{j=0}^{\mathrm{\infty}}}{\gamma}^{j}((j)n+1)\epsilon +{\displaystyle \sum _{j=0}^{\mathrm{\infty}}}{\gamma}^{j}((j1)n+1)\epsilon +{\displaystyle \sum _{j=0}^{\mathrm{\infty}}}{\gamma}^{j}n\epsilon $  
$=\epsilon {\displaystyle \sum _{j=0}^{\mathrm{\infty}}}{\gamma}^{j}\left((jn+1)+((j1)n+1)+n\right)$  
$=2n\epsilon {\displaystyle \sum _{j=0}^{\mathrm{\infty}}}j{\gamma}^{j}+2\epsilon {\displaystyle \sum _{j=0}^{\mathrm{\infty}}}{\gamma}^{j}$  
$=2\epsilon \left({\displaystyle \frac{n\gamma}{{(1\gamma )}^{2}}}+{\displaystyle \frac{1}{1\gamma}}\right)$  
$=2\epsilon {\displaystyle \frac{(n1)\gamma +1}{{(1\gamma )}^{2}}}$ 
∎
A.6 Proof of Theorem 6.2
Proof.
Here we invoke The DvoretzkyKieferWolfowitz inequality [1]. Consider $m$ samples of a random variable ${Y}_{ia}$ with domain $\{1,\mathrm{\dots},n\}$, let ${y}_{ia}^{(1)},\mathrm{\dots},{y}_{ia}^{(m)}\in \{1,\mathrm{\dots},n\}$ correspond to the observed resulting state under an action $a$ taken at a state $i$. Let ${\widehat{T}}_{ia}(s)=\frac{1}{m}{\sum}_{j=1}^{m}\mathrm{\U0001d7d9}[{y}_{ia}^{(j)}\le s]$ be an estimate of the CDF of ${Y}_{ia}$ and let ${T}_{ia}(s)=P[{Y}_{ia}\le s]$ be the actual CDF. From the DvoretzkyKieferWolfowitz inequality we have
$$\mathbb{P}(\underset{s\in \{1,\mathrm{\dots},n\}}{sup}{\widehat{T}}_{ia}(s){T}_{ia}(s)>\epsilon )\le 2{e}^{2m{\epsilon}^{2}}$$ 
$$\u27f9\mathbb{P}(\underset{s\in \{1,\mathrm{\dots},n\}}{sup}{\widehat{T}}_{ia}(s){T}_{ia}(s)\le \epsilon )>12{e}^{2m{\epsilon}^{2}}$$ 
Now consider the PDF of ${Y}_{ia}$ given by ${\widehat{\text{\mathbf{p}}}}_{ia}(s)={\widehat{T}}_{ia}(s){\widehat{T}}_{ia}(s1)$. Notice that
$\left{\widehat{\text{\mathbf{p}}}}_{ia}(s){\text{\mathbf{p}}}_{ia}(s)\right$  $\le \left\left({\widehat{T}}_{ia}(s){\widehat{T}}_{ia}(s1)\right)\left({T}_{ia}(s){T}_{ia}(s1)\right)\right$  
$\le \left{\widehat{T}}_{ia}(s){T}_{ia}(s)\right+\left{\widehat{T}}_{ia}(s1){T}_{ia}(s1)\right$ 
So if we have
$$\underset{s\in \{1,\mathrm{\dots},n\}}{sup}\left{\widehat{T}}_{ia}(s){T}_{ia}(s)\right\le \epsilon $$ 
then
$$\underset{s\in \{1,\mathrm{\dots},n\}}{sup}\left{\widehat{\text{\mathbf{p}}}}_{ia}(s){\text{\mathbf{p}}}_{ia}(s)\right\le 2\epsilon $$ 
$$\u27f9\mathbb{P}(\underset{s\in \{1,\mathrm{\dots},n\}}{sup}{\widehat{\text{\mathbf{p}}}}_{ia}(s){\text{\mathbf{p}}}_{ia}(s)\le \epsilon )>12{e}^{m{\epsilon}^{2}/2}$$ 
Here we can interpret ${\widehat{\text{\mathbf{p}}}}_{ia}(\cdot )$ and ${\text{\mathbf{p}}}_{ia}(\cdot )$ as the $i$th rows of the matrices ${\widehat{P}}_{a}$ and ${P}_{a}$ respectively. $\widehat{\text{\mathbf{p}}}({Y}_{ia})$, is the maximum likelihood estimator formed from $m$ samples. From application of the union bound over all rows of the matrix ${P}_{a}$, we have for $\epsilon >0$, and $m$ samples,
$$  $>12n{e}^{m{\epsilon}^{2}/2}$  
$\u27f9\mathbb{P}[{\parallel {\widehat{P}}_{a}{P}_{a}\parallel}_{\mathrm{\infty}}\le \epsilon ]\ge 1\delta ,\delta \in (0,1)$ 
if $m\ge \frac{2}{{\epsilon}^{2}}\mathrm{log}\frac{2n}{\delta}$ ∎
A.7 Proof of Theorem 6.3
Proof.
Without loss of generality, let every state $j=1,\mathrm{\dots},n$ be reachable from state $j=1$ by action ${a}_{1}$ after a step with probability at least $\alpha $. Let ${Y}_{ja}$ be a random variable domain $\{1,\mathrm{\dots},n\}$. Let ${Z}_{j}$ be a Bernoulli random variable such that $P({Z}_{j}=1)\ge \alpha \forall j$. Let $({z}_{j}^{(1)},{y}_{j}^{(1)}),\mathrm{\dots},({z}_{j}^{(m)},{y}_{j}^{(m)})$ be $m$ pairs of independent samples of ${Z}_{j}$ and ${Y}_{aj}$. Here ${Z}_{j}$ represents the state chain $1\stackrel{{a}_{1}}{\to}j\to \mathrm{\dots}$
Consider the event ${A}_{1}\equiv \{\frac{1}{m}{\sum}_{k=1}^{m}{z}_{j}^{(k)}\ge \alpha \u03f5\forall j\}$. By the onesided Hoeffding’s inequality and taking the union bound over all states we have
$$\mathbb{P}({A}_{1})\ge 1n{e}^{2{\u03f5}^{2}m}$$ 
We also have the conditional maximum likelihood probability estimator
$$\widehat{\text{\mathbf{p}}}({Y}_{j}=s{Z}_{j}=1)=\frac{1}{{\sum}_{k=1}^{m}{z}_{j}^{(k)}}\sum _{l=1}^{m}\mathrm{\U0001d7d9}[({y}_{j}^{(l)}=s)\wedge {z}_{j}^{(l)}]$$ 
From Theorem 6.2 we have for event
$${A}_{2}\equiv \{{\parallel \widehat{\text{\mathbf{p}}}({Y}_{ja}{Z}_{j}=1)\text{\mathbf{p}}({Y}_{ja}{Z}_{j}=1)\parallel}_{\mathrm{\infty}}\le \beta \}$$ 
$$\mathbb{P}({A}_{2}{A}_{1})\ge 12n{e}^{2{\beta}^{2}m(\alpha \u03f5)/2}$$ 
By the law of total probability
$\mathbb{P}({A}_{2})\ge \mathbb{P}({A}_{2},{A}_{1})$  $=\mathbb{P}({A}_{2}{A}_{1})\mathbb{P}({A}_{1})$  
$=\left(1n{e}^{2{\u03f5}^{2}m}\right)\left(12n{e}^{2{\beta}^{2}m(\alpha \u03f5)/2}\right)$  
$\ge 1n{e}^{2{\u03f5}^{2}m}2n{e}^{2{\beta}^{2}m(\alpha \u03f5)/2}$ 
By solving $\frac{\delta}{2}=n{e}^{2{\u03f5}^{2}m}$ and $\frac{\delta}{2}=2n{e}^{2{\beta}^{2}m(\alpha \u03f5)/2}$ we can see that if $m\ge \mathrm{max}\{\frac{1}{2{\u03f5}^{2}}\mathrm{log}\frac{2n}{\delta},\frac{2}{(\alpha \u03f5){\beta}^{2}}\mathrm{log}\frac{4n}{\delta}\}$ then $\mathbb{P}({A}_{2})\ge 1\delta $ Letting $\u03f5=\frac{\alpha}{2}$ and taking the union bound over all actions $a\in A$ we have if $m\ge \frac{4}{\alpha {\beta}^{2}}\mathrm{log}\frac{4nk}{\delta}$ then
$$\mathbb{P}[{\parallel {\widehat{P}}_{a}{P}_{a}\parallel}_{\mathrm{\infty}}\le \beta ]\ge 1\delta ,\delta \in (0,1)\forall a\in A$$ 
∎
Appendix B Experiment setup and additional results
Experiments were performed in MATLAB using randomly generated transition probability matrices for $\beta $strictly separable MDPs with $n$ states, $k$ actions, $\gamma $.
We generate the rows of the transition probability matrices individually in one of two different ways. In the first method each row ${P}_{a}(i)=x\in {[0,1]}^{n}$ is generated as a uniformly sampled point from the region $\mathcal{C}:=\{x\in {\mathbb{R}}^{n}\mid x\u2ab00,{\parallel x\parallel}_{1}=1\}$. In the second method, we wanted to simulate situations where taking an action at a state would lead to a few states (say ${n}_{1}$ states) with a large probability and the remaining $n{n}_{1}$ states with a small probability. This is similar to a situation where the action chosen is followed with a large probability and where state jumps to a random state with a small probability. This is based on the idea of following the action with a large probability and randomly "exploring" with a small probability. This is similar to transition rules used by [5] in their experiment. Both of these methods were tested in generating the "true" transition probability matrices. The results shown in Figure 4 of this supplement were obtained using transition probability matrices generated by the first method. The results presented in Figure 3 in the main paper and in Figure 5 of this supplement were obtained for transition probability matrices generated by the second method. We also checked all generated transition probability matrices to ensure $\beta $separability. The maximum likelihood estimates ${\widehat{P}}_{ai}$ of these transition probability matrices were formed by sampling trajectories under the true transition probability matrices with the action chosen uniformly at random at each state. Several trajectories were formed, each with a random initial state to ensure that each state was reachable in the simulations.
Recall that ${F}_{ai}=({P}_{{a}_{1}}(i){P}_{a}(i)){(I\gamma {P}_{{a}_{1}})}^{1}$ and ${\widehat{F}}_{ai}=({\widehat{P}}_{{a}_{1}}(i){\widehat{P}}_{a}(i)){(I\gamma {\widehat{P}}_{{a}_{1}})}^{1}$. Reward functions $\widehat{R}$ were found by solving our L1regularized SVM formulation, and the method of Ng & Russel (2000), using the same set of estimated transition probabilities, i.e., ${\widehat{F}}_{ai}$. The resulting reward functions were then tested using the true transition probabilities for ${F}_{ai}^{T}\widehat{R}\ge 0$.
Additional results for $30$ repetitions of $n=5$ states, $k=5$ actions, separability $\beta =0.0032$, with transition probabilities generated using the first method are shown in Figure 4. Results for $20$ repetitions of $n=10$ states, $k=10$ actions, separability $\beta =0.0032$, with transition probabilities generated using the second method are shown in Figure 5.