Abstract
Modern deep learning methods provide an effective means to learn goodrepresentations. However, is a good representation itself sufficient forefficient reinforcement learning? This question is largely unexplored, and theextant body of literature mainly focuses on conditions which permit efficientreinforcement learning with little understanding of what are necessaryconditions for efficient reinforcement learning. This work provides strongnegative results for reinforcement learning methods with function approximationfor which a good representation (feature extractor) is known to the agent,focusing on natural representational conditions relevant to valuebasedlearning and policybased learning. For valuebased learning, we show that evenif the agent has a highly accurate linear representation, the agent still needsto sample exponentially many trajectories in order to find a nearoptimalpolicy. For policybased learning, we show even if the agent's linearrepresentation is capable of perfectly representing the optimal policy, theagent still needs to sample exponentially many trajectories in order to find anearoptimal policy. These lower bounds highlight the fact that having a good (valuebased orpolicybased) representation in and of itself is insufficient for efficientreinforcement learning. In particular, these results provide new insights intowhy the existing provably efficient reinforcement learning methods rely onfurther assumptions, which are often modelbased in nature. Additionally, ourlower bounds imply exponential separations in the sample complexity between 1)valuebased learning with perfect representation and valuebased learning witha goodbutnotperfect representation, 2) valuebased learning and policybasedlearning, 3) policybased learning and supervised learning and 4) reinforcementlearning and imitation learning.
Quick Read (beta)
arrows
Modern deep learning methods provide an effective means to learn good representations. However, is a good representation itself sufficient for efficient reinforcement learning? This question is largely unexplored, and the extant body of literature mainly focuses on conditions which permit efficient reinforcement learning with little understanding of what are necessary conditions for efficient reinforcement learning. This work provides strong negative results for reinforcement learning methods with function approximation for which a good representation (feature extractor) is known to the agent, focusing on natural representational conditions relevant to valuebased learning and policybased learning. For valuebased learning, we show that even if the agent has a highly accurate linear representation, the agent still needs to sample exponentially many trajectories in order to find a nearoptimal policy. For policybased learning, we show even if the agent’s linear representation is capable of perfectly representing the optimal policy, the agent still needs to sample exponentially many trajectories in order to find a nearoptimal policy.
These lower bounds highlight the fact that having a good (valuebased or policybased) representation in and of itself is insufficient for efficient reinforcement learning. In particular, these results provide new insights into why the existing provably efficient reinforcement learning methods rely on further assumptions, which are often modelbased in nature. Additionally, our lower bounds imply exponential separations in the sample complexity between 1) valuebased learning with perfect representation and valuebased learning with a goodbutnotperfect representation, 2) valuebased learning and policybased learning, 3) policybased learning and supervised learning and 4) reinforcement learning and imitation learning.
1 Introduction
Modern reinforcement learning (RL) problems are often challenging due to the huge state space. To tackle this challenge, function approximation schemes are often employed to provide a compact representation, so that reinforcement learning can generalize across states. A common paradigm is to first use a feature extractor to transform the raw input to features (a succinct representation) and then apply a linear predictor on top of the features. Traditionally, the feature extractor is often handcrafted (sutton2018reinforcement), while more modern methods often train a deep neural network to extract features. The hope of this paradigm is that, if there exists a good low dimensional (linear) representation, then efficient reinforcement learning is possible.
Empirically, combining various RL function approximation algorithms with neural networks for feature extraction has lead to tremendous successes on various tasks (mnih2015human; schulman2015trust; schulman2017proximal). A major problem, however, is that these methods often require a large amount of samples to learn a good policy. For example, deep $Q$network requires millions of samples to solve certain Atari games (mnih2015human). Here, one may wonder if there are fundamental statistical limitations on such methods, and, if so, under what conditions it would be possible to efficiently learn a good policy?
In the supervised learning context, it is wellknown that empirical risk minimization is a statistically efficient method when using a lowcomplexity hypothesis space (shalev2014understanding), e.g. a hypothesis space with bounded VC dimension. For example, a polynomial number of samples suffice for learning a nearoptimal $d$dimensional linear classifier, even in the agnostic setting^{1}^{1} 1 Here we only study the sample complexity and ignore the computational complexity.. In contrast, in the more challenging RL setting, we seek to understand if efficient learning is possible (say from a sample complexity perspective) when we have access to an accurate (and compact) parametric representation — e.g. our policy class contains a nearoptimal policy or our value function hypothesis class accurately approximates the true value functions. In particular, this work focuses on the following question:
Is a good representation sufficient for sampleefficient reinforcement learning?
This question is largely unexplored, where the extant body of literature mainly focuses on conditions which are sufficient for efficient reinforcement learning though there is little understanding of what are necessary conditions for efficient reinforcement learning. The challenge in reinforcement learning is that it is not evident how agents can leverage the given representation to efficiently find a nearoptimal policy for reasons related to the explorationexploitation tradeoff; there is no direct analogue of empirical risk minimization in the reinforcement learning context.
Many recent works have provided polynomial upper bounds under various sufficient conditions, and in what follows we list a few examples. For valuebased learning, the work of wen2013efficient showed that for deterministic systems^{2}^{2} 2 MDPs where both reward and transition are deterministic., if the optimal $Q$function can be perfectly predicted by linear functions of the given features, then the agent can learn the optimal policy exactly with a polynomial number of samples. Recent work (jiang2017contextual) further showed that if the Bellman rank, a certain complexity measure, is bounded, then the agent can learn a nearoptimal policy efficiently. For policybased learning, agarwal2019optimality gave polynomial upper bounds which depend on a parameter that measures the difference between the initial distribution and the distribution induced by the optimal policy.
Our contributions. This work gives, perhaps surprisingly, strong negative results to this question. The main results are exponential sample complexity lower bounds in terms of planning horizon $H$ for valuebased and policybased algorithms with given good representations^{3}^{3} 3 Our results can be easily extend to infinite horizon MDPs with discount factors by replacing the planning horizon $H$ with $\frac{1}{1\gamma}$, where $\gamma $ is the discount factor. We omit the discussion on discount MDPs for simplicity. . A summary of previous upper bounds and along with our new lower bounds is provided in Table 1. These lower bounds include:

1.
For valuebased learning, we show even if the $Q$functions of all policies can be approximated, in a worst case sense, by linear functions of the given representation with approximation error $\delta =\mathrm{\Omega}\left(\sqrt{\frac{H}{d}}\right)$ where $d$ is the dimension of the representation and $H$ is the planning horizon, then the agent still needs to sample exponential number of trajectories to find a nearoptimal policy.

2.
We show even if the optimal policy can be perfectly predicted by a linear function of the given representation with a strictly positive margin, the agent still requires exponential number of trajectories to find a nearoptimal policy.
These lower bounds hold even in deterministic systems and even if the agent knows the transition model. Furthermore, these negative results also apply to the case where ${Q}^{*}$, the optimal stateaction value, can be accurately approximated by a linear function. Since the class of linear functions is a strict subset of many more complicated function classes, including neural networks in particular, our negative results imply lower bounds for these more complex function classes as well.
Our results highlight a few conceptual insights:

•
Efficient RL may require the representation to encode model information (transition and reward). Under (implicit) modelbased assumptions, there exist upper bounds that can tolerate approximation error (jiang2017contextual; yang2019sample; sun2019model).

•
Since our lower bounds apply even when the agent knows the transition model, the hardness is not due to the difficulty of exploration in the standard sense. The unknown reward function is sufficient to make the problem exponentially difficult.

•
Our lower bounds are not due to the agent’s inability to perform efficient supervised learning, since our assumptions do admit polynomial sample complexity upper bounds if the data distribution is fixed.

•
Our lower bounds are not pathological in nature and suggest that these concerns may arise in practice. In a precise sense, almost all feature extractors induce a hard MDP instance in our construction (see Section 4.3).
Instead, one interpretation is that the hardness is due to a distribution mismatch in the following sense: the agent does not know which distribution to use for minimizing a (supervised) learning error (see kakade2003sample for discussion), and even a known transition model is not informationtheoretically sufficient to reduce the sample complexity.
Furthermore, our work implies several exponential separations on the sample complexity between: 1) valuebased learning with a perfect representation and valuebased learning with a goodbutnotperfect representation, 2) valuebased learning and policybased learning, 3) policybased learning and supervised learning and 4) reinforcement learning and imitation learning. We provide more details in Section 6.
Query Oracle  RL  Generative Model  Known Transition 
Previous Upper Bounds  
Exact Linear ${Q}^{*}$ + DetMDP (wen2013efficient)  ✓  ✓  ✓ 
Exact Linear ${Q}^{*}$ + BellmanRank (jiang2017contextual)  ✓  ✓  ✓ 
Exact Linear ${Q}^{*}$ + Low Var + Gap (du2019provably)  ✓  ✓  ✓ 
Exact Linear ${Q}^{*}$ + Gap (Open Problem / Theorem B.1)  ?  ✓  ✓ 
Exact Linear ${Q}^{\pi}$ for all $\pi $ (Open Problem / Theorem C.1)  ?  ✓  ✓ 
\makecell Approx. Linear ${Q}^{\pi}$ for all $\pi $ +  
Bounded Conc. Coeff. (munos2005error; antos2008learning)  ✓✗  ✓  ✓ 
\makecellApprox. Linear ${Q}^{\pi}$ for all $\pi $ +  
Bounded Dist. Mismatch Coeff. (agarwal2019optimality)  ✓✗  ✓  ✓ 
Lower Bounds (this work)  
Approx. Linear ${Q}^{*}$+ DetMDP (Theorem 4.1)  ✗  ✗  ✗ 
Approx. Linear ${Q}^{\pi}$ for all $\pi $ + DetMDP(Theorem 4.1)  ✗  ✗  ✗ 
Exact Linear ${\pi}^{*}$ + Margin + Gap + DetMDP (Theorem 4.2)  ✗  ✗  ✗ 
Exact Linear ${Q}^{*}$ (Open Problem)  ?  ?  ? 
2 Related Work
A summary of previous upper bounds, together with lower bounds proved in this work, is provided in Table 1. Some key assumptions are formally stated in Section 3 and Section 4. Our lower bounds highlight that classical complexity measures in supervised learning including small approximation error and margin, and standard assumptions in reinforcement learning including optimality gap and deterministic systems, are not enough for efficient RL with function approximation. We need additional assumptions, e.g., ones used in previous upper bounds, for efficient RL.
2.1 Previous Lower Bounds
Existing exponential lower bounds, to our knowledge, construct unstructured MDPs with an exponentially large state space and reduce a bandit problem with exponentially many arms to an MDP (krishnamurthy2016pac; sun2017deeply). However, these lower bounds do not immediately apply to MDPs whose transition models, value functions, or policies can be approximated with some natural function classes, e.g., linear functions, neural networks, etc. The current work gives the first set of lower bounds for RL with linear function approximation (and thus also hold for superclasses of linear functions like neural networks).
2.2 Previous Upper Bounds
We divide previous algorithms (with provable guarantees) into three classes: those that utilize uncertaintybased bonuses (e.g. UCB variants or Thompson sampling variants); approximate dynamic programming variants; and direct policy searchbased methods (such as Conserve Policy Iteration (CPI) (kakade2003sample)) or policy gradient methods. The first class of methods include those based on witness rank, Belman rank, and the Eluder dimension, while the latter two classes of algorithms make assumptions either on concentrability coefficients or on distribution mismatch coefficients (see agarwal2019optimality; Scherrer:API for discussions).
Uncertainty bonusbased algorithms. Now we discuss existing theoretical results on valuebased learning with function approximation. wen2013efficient showed that in deterministic systems, if the optimal $Q$function is within a prespecified function class which has bounded Eluder dimension (for which the class of linear functions is a special case), then the agent can learn the optimal policy using a polynomial number of samples. This result has recently been generalized by du2019provably which can deal with stochastic reward and low variance transition but requires strictly positive optimality gap. As we listed in Table 1, it is an open problem whether the condition that the optimal $Q$function is linear itself is sufficient for efficient RL.
li2011knows proposed a $Q$learning algorithm which requires the KnowWhatItKnows oracle. However, it is in general unknown how to implement such oracle in practice. jiang2017contextual proposed the concept of Bellman Rank to characterize the sample complexity of valuebased learning methods and gave an algorithm that has polynomial sample complexity in terms of the Bellman Rank, though the proposed algorithm is not computationally efficient. Bellman rank is bounded for a wide range of problems, including MDP with small number of hidden states, linear MDP, LQR, etc. Later work gave computationally efficient algorithms for certain special cases (dann2018polynomial; du2019provably; yang2019sample; jin2019provably). Recently, Witness rank, a generalization of Bellman rank to modelbased methods, is studied in sun2019model.
Approximate dynamic programmingbased algorithms. We now discuss approximate dynamic programmingbased results characterized in terms of the concentrability coefficient. While classical approximate dynamic programming results typically require ${\mathrm{\ell}}_{\mathrm{\infty}}$bounded errors, the notion of concentrability (originally due to (munos2005error)) permits sharper bounds in terms of average case function approximation error, provided that the concentrability coefficient is bounded (e.g. see munos2005error; szepesvari2005finite; antos2008learning; geist2019theory). Under the assumption that this problemdependent parameter is bounded, munos2005error; szepesvari2005finite and antos2008learning provided sample complexity and error bounds for approximate dynamic programming methods when there is a data collection policy (under which valuefunction fitting occurs) that induces a finite concentrability coefficient. The assumption that the concentrability coefficient is finite is in fact quite limiting. See chen2019information for a more detailed discussion on this quantity.
Direct policy searchbased algorithms. Stronger guarantees over approximate dynamic programmingbased algorithms can be obtained with direct policy searchbased methods, where instead of having a bounded concentrability coefficient, one only needs to have a bounded distribution mismatch coefficient. The latter assumption requires the agent to have access to a “good” initial state distribution (e.g. a measure which has coverage over where an optimal policy tends to visit); note that this assumption does not make restrictions over the class of MDPs. There are two classes of algorithms that fall into this category. First, there is Conservative Policy Iteration (kakade2002approximately), along with Policy Search by Dynamic Programming (PSDP) (NIPS2003_2378), and other boostingstyle of policy searchbased methods scherrer2014local; Scherrer:API, which have guarantees in terms of bounded distribution mismatch ratio. Second, more recently, agarwal2019optimality showed that policy gradient styles of algorithms also have comparable guarantees; the results also directly imply the learnability results for the “Approx. Linear ${Q}^{\pi}$ for all $\pi $” row in Table 1. Similar guarantees can be obtained with CPI (and its variants) with comparable assumptions.
3 Preliminaries
Throughout this paper, for a given integer $H$, we use $[H]$ to denote the set $\{0,1,\mathrm{\dots},H1\}$.
3.1 Episodic Reinforcement Learning
Let $\mathcal{M}=(\mathcal{S},\mathcal{A},H,P,R)$ be an Markov Decision Process (MDP) where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space whose size is bounded by a constant, $H\in {\mathbb{Z}}_{+}$ is the planning horizon, $P:\mathcal{S}\times \mathcal{A}\to \mathrm{\u25b3}\left(\mathcal{S}\right)$ is the transition function which takes a stateaction pair and returns a distribution over states and $R:\mathcal{S}\times \mathcal{A}\to \mathrm{\u25b3}\left(\mathbb{R}\right)$ is the reward distribution. Without loss of generality, we assume a fixed initial state ${s}_{0}$^{4}^{4} 4 Some papers assume the initial state is sampled from a distribution ${P}_{1}$. Note this is equivalent to assuming a fixed initial state ${s}_{0}$, by setting $P({s}_{0},a)={P}_{1}$ for all $a\in \mathcal{A}$ and now our state ${s}_{1}$ is equivalent to the initial state in their assumption.. A policy $\pi :\mathcal{S}\to \mathrm{\u25b3}(\mathcal{A})$ prescribes a distribution over actions for each state. The policy $\pi $ induces a (random) trajectory ${s}_{0},{a}_{0},{r}_{0},{s}_{1},{a}_{1},{r}_{1},\mathrm{\dots},{s}_{H1},{a}_{H1},{r}_{H1}$ where ${a}_{0}\sim \pi ({s}_{0})$, ${r}_{0}\sim R({s}_{0},{a}_{0})$, ${s}_{1}\sim P({s}_{0},{a}_{0})$, ${a}_{1}\sim \pi ({s}_{1})$, etc. To streamline our analysis, for each $h\in [H]$, we use ${\mathcal{S}}_{h}\subseteq \mathcal{S}$ to denote the set of states at level $h$, and we assume ${\mathcal{S}}_{h}$ do not intersect with each other. We also assume ${\sum}_{h=0}^{H1}{r}_{h}\in [0,1]$ almost surely. Our goal is to find a policy $\pi $ that maximizes the expected total reward $\mathbb{E}\left[{\sum}_{h=0}^{H1}{r}_{h}\mid \pi \right]$. We use ${\pi}^{*}$ to denote the optimal policy. We say a policy $\pi $ is $\epsilon $optimal if $\mathbb{E}\left[{\sum}_{h=0}^{H1}{r}_{h}\mid \pi \right]\ge \mathbb{E}\left[{\sum}_{h=0}^{H1}{r}_{h}\mid {\pi}^{*}\right]\epsilon $.
In this paper we prove lower bounds for deterministic systems, i.e., MDPs with deterministic transition $P$, deterministic reward $R$. In this setting, $P$ and $R$ can be regarded as functions instead of distributions. Since deterministic systems are special cases of general stochastic MDPs, lower bounds proved in this paper still hold for more general MDPs.
3.2 $Q$function, $V$function and Optimality Gap
An important concept in RL is the $Q$function. Given a policy $\pi $, a level $h\in [H]$ and a stateaction pair $(s,a)\in {\mathcal{S}}_{h}\times \mathcal{A}$, the $Q$function is defined as ${Q}_{h}^{\pi}(s,a)=\mathbb{E}[{\sum}_{{h}^{\prime}=h}^{H1}{r}_{{h}^{\prime}}\mid {s}_{h}=s,{a}_{h}=a,\pi ]$. For simplicity, we denote ${Q}_{h}^{*}(s,a)={Q}_{h}^{{\pi}^{*}}(s,a)$. It will also be useful to define the value function of a given state $s\in {\mathcal{S}}_{h}$ as ${V}_{h}^{\pi}(s)=\mathbb{E}[{\sum}_{{h}^{\prime}=h}^{H1}{r}_{{h}^{\prime}}\mid {s}_{h}=s,\pi ]$. For simplicity, we denote ${V}_{h}^{*}(s)={V}_{h}^{{\pi}^{*}}(s)$. Throughout the paper, for the $Q$function ${Q}_{h}^{\pi}$ and ${Q}_{h}^{*}$ and the value function ${V}_{h}^{\pi}$ and ${V}_{h}^{*}$, we may omit $h$ from the subscript when it is clear from the context.
In addition to these definitions, we list below an important assumption, the optimality gap assumption, which is widely used in reinforcement learning and bandit literature. To state the assumption, we first define the function $\mathrm{gap}:\mathcal{S}\times \mathcal{A}\to \mathbb{R}$ as $\mathrm{gap}(s,a)={\mathrm{max}}_{{a}^{\prime}\in \mathcal{A}}{Q}^{*}(s,{a}^{\prime}){Q}^{*}(s,a)$. Now we formally state the assumption.
Assumption 3.1 (Optimality Gap).
There exists $\rho \mathrm{>}\mathrm{0}$ such that $\rho \mathrm{\le}\mathrm{gap}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}$ for all $\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{\in}\mathrm{S}\mathrm{\times}\mathrm{A}$ with $\mathrm{gap}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{>}\mathrm{0}$.
Here, $\rho $ is the smallest rewardtogo difference between the best set of actions and the rest. Recently, du2019qlearning gave a provably efficient $Q$learning algorithm based on this assumption and simchowitz2019non showed that with this condition, the agent only incurs logarithmic regret in the tabular setting.
3.3 Query Models
Here we discuss three possible query oracles interacting with the MDP.

•
RL: The most basic and weakest query oracle for MDP is the standard reinforcement learning query oracle where the agent can only interact with the MDP by choosing actions and observe the next state and the reward.

•
Generative Model: A stronger query model assumes the agent can transit to any state (kearns2002near; kakade2003sample; sidford2018near). This query model is available in certain robotic applications where one can control the robot to reach the target state.

•
Known Transition: The strongest query model considered is that the agent can not only transit to any state, but it also knows the whole transition. In this model, only the reward is unknown.
In this paper, we will prove lower bounds for the strongest Known Transition query oracle. Therefore, our lower bounds also apply to RL and Generative Model query oracles.
4 Main Results
In this section we formally present our lower bounds. We also discuss proof ideas in Section 4.3.
4.1 Lower Bound for Valuebased Learning
We first present our lower bound for valuebased learning. A common assumption is that the $Q$function can be predicted well by a linear function of the given features (representation) (bertsekas1996neuro). Formally, the agent is given a feature extractor $\varphi :\mathcal{S}\times \mathcal{A}\to {\mathbb{R}}^{d}$ which can be handcrafted or a pretrained neural network that transforms a stateaction pair to a $d$dimensional embedding. The following assumption states that the given feature extractor can be used to predict the $Q$function with approximation error at most $\delta $ using a linear function.
Assumption 4.1 (${Q}^{*}$ Realizability).
There exists $\delta \mathrm{>}\mathrm{0}$ and ${\theta}_{\mathrm{0}}\mathrm{,}{\theta}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{\theta}_{H\mathrm{}\mathrm{1}}\mathrm{\in}{\mathrm{R}}^{d}$ such that for any $h\mathrm{\in}\mathrm{[}H\mathrm{]}$ and any $\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{\in}{\mathrm{S}}_{h}\mathrm{\times}\mathrm{A}$, $\mathrm{\left}{Q}_{h}^{\mathrm{*}}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{}\mathrm{\u27e8}{\theta}_{h}\mathrm{,}\varphi \mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{\u27e9}\mathrm{\right}\mathrm{\le}\delta \mathrm{.}$
Here $\delta $ is the approximation error, which indicates the quality of the representation. If $\delta =0$, then $Q$function can be perfectly predicted by a linear function of $\varphi (\cdot ,\cdot )$. In general, $\delta $ becomes smaller as we increase the dimension of $\varphi $, since larger dimension usually has more expressive power. When the feature extractor is strong enough, previous papers (chen2019information; farahmand2011regularization) assume that linear functions of $\varphi $ can approximate the $Q$function of any policy.
Assumption 4.2 (Value Completeness).
There exists $\delta \mathrm{>}\mathrm{0}$, such that for any $h\mathrm{\in}\mathrm{[}H\mathrm{]}$ and any policy $\pi $, there exists ${\theta}_{h}^{\pi}\mathrm{\in}{\mathrm{R}}^{d}$ such that for any $\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{\in}{\mathrm{S}}_{h}\mathrm{\times}\mathrm{A}$, $\mathrm{\left}{Q}_{h}^{\pi}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{}\mathrm{\u27e8}{\theta}_{h}\mathrm{,}\varphi \mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{\u27e9}\mathrm{\right}\mathrm{\le}\delta \mathrm{.}$
In the theoretical reinforcement learning literature, Assumption 4.2 is often called the (approximate) policy completeness assumption. This assumption is crucial in proving polynomial sample complexity guarantee for value iteration type of algorithms (chen2019information; farahmand2011regularization).
The following theorem shows when $\delta =\mathrm{\Omega}\left(\sqrt{\frac{H}{d}}\right)$, the agent needs to sample exponential number of trajectories to find a nearoptimal policy.
Theorem 4.1 (Exponential Lower Bound for Valuebased Learning).
There exists a family of MDPs with $\mathrm{}\mathrm{A}\mathrm{}\mathrm{=}\mathrm{2}$ and a feature extractor $\varphi $ that satisfy Assumption 4.2, such that any algorithm that returns a $\mathrm{1}\mathrm{/}\mathrm{2}$optimal policy with probability $\mathrm{0.9}$ needs to sample $\mathrm{\Omega}\mathit{}\mathrm{\left(}\mathrm{min}\mathit{}\mathrm{\{}\mathrm{}\mathrm{S}\mathrm{}\mathrm{,}{\mathrm{2}}^{H}\mathrm{,}\mathrm{exp}\mathit{}\mathrm{(}d\mathit{}{\delta}^{\mathrm{2}}\mathrm{/}\mathrm{16}\mathrm{)}\mathrm{\}}\mathrm{\right)}$ trajectories.
Note this lower bound also applies to MDPs that satisfy Assumption 4.1, since Assumption 4.2 is a strictly stronger assumption. We would like to emphasize that since linear functions is a subclass of more complicated function classes, e.g., neural networks, our lower bound also holds for these function classes. Moreover, the assumption that $\mathcal{A}=2$ is only for simplicity. Our lower bound can be easily generalized to the case that $\mathcal{A}>2$, in which case the sample complexity lower bound is $\mathrm{\Omega}\left(\mathrm{min}\{\mathcal{S},{\mathcal{A}}^{H},\mathrm{exp}(d{\delta}^{2}/16)\}\right)$.
4.2 Lower Bound for Policybased Learning
Next we present our lower bound for policybased learning. This class of methods use function approximation on the policy and use optimization techniques, e.g., policy gradient, to find the optimal policy. In this paper, we focus on linear policies on top of a given representation. A linear policy $\pi $ is a policy of the form $\pi ({s}_{h})=\mathrm{arg}{\mathrm{max}}_{a\in \mathcal{A}}\u27e8{\theta}_{h},\varphi ({s}_{h},a)\u27e9$ where ${s}_{h}\in {\mathcal{S}}_{h}$, $\varphi (\cdot ,\cdot )$ is a given feature extractor and ${\theta}_{h}\in {\mathbb{R}}^{d}$ is the linear coefficient. Note that applying policy gradient on softmax parameterization of the policy is indeed trying to find the optimal policy among linear policies.
Similar to valuebased learning, a natural assumption for policybased learning is that the optimal policy is realizable^{5}^{5} 5 Unlike valuebased learning, it is hard to define completeness on the policybased learning with function approximation, since not all policy has the $\mathrm{arg}\mathrm{max}$ form. .
Assumption 4.3 (${\pi}^{*}$ Realizability).
For any $h\mathrm{\in}\mathrm{[}H\mathrm{]}$, there exists ${\theta}_{h}\mathrm{\in}{\mathrm{R}}^{d}$ that satisfies for any $s\mathrm{\in}{\mathrm{S}}_{h}$, we have ${\pi}^{\mathrm{*}}\mathit{}\mathrm{\left(}s\mathrm{\right)}\mathrm{\in}{\mathrm{arg}\mathit{}\mathrm{max}}_{a}\mathit{}\mathrm{\u27e8}{\theta}_{h}\mathrm{,}\varphi \mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{\u27e9}\mathrm{.}$
Here we discuss another assumption. For learning a linear classifier in the supervised learning setting, one can reduce the sample complexity significantly if the optimal linear classifier has a margin.
Assumption 4.4 (${\pi}^{*}$ Realizability + Margin).
We assume $\varphi \mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{\in}{\mathrm{R}}^{d}$ satisfies ${\mathrm{\parallel}\varphi \mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{\parallel}}_{\mathrm{2}}\mathrm{=}\mathrm{1}$ for any $\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{\in}\mathrm{S}\mathrm{\times}\mathrm{A}$. For any $h\mathrm{\in}\mathrm{[}H\mathrm{]}$, there exists ${\theta}_{h}\mathrm{\in}{\mathrm{R}}^{d}$ with ${\mathrm{\parallel}{\theta}_{h}\mathrm{\parallel}}_{\mathrm{2}}\mathrm{=}\mathrm{1}$ and $\mathrm{\u25b3}\mathrm{>}\mathrm{0}$ such that for any $s\mathrm{\in}{\mathrm{S}}_{h}$, there is a unique optimal action ${\pi}^{\mathrm{*}}\mathit{}\mathrm{(}s\mathrm{)}$, and for any $a\mathrm{\ne}{\pi}^{\mathrm{*}}\mathit{}\mathrm{(}s\mathrm{)}$, $\mathrm{\u27e8}{\theta}_{h}\mathrm{,}\varphi \mathit{}\mathrm{(}s\mathrm{,}{\pi}^{\mathrm{*}}\mathit{}\mathrm{(}s\mathrm{)}\mathrm{)}\mathrm{\u27e9}\mathrm{}\mathrm{\u27e8}{\theta}_{h}\mathrm{,}\varphi \mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{\u27e9}\mathrm{\ge}\mathrm{\u25b3}$.
Here we restrict the linear coefficients and features to have unit norm for normalization. Note that Assumption 4.4 is strictly stronger than Assumption 4.3. Now we present our result for linear policy.
Theorem 4.2 (Exponential Lower Bound for Policybased Learning).
There exists an absolute constant ${\mathrm{\u25b3}}_{\mathrm{0}}$, such that for any $\mathrm{\u25b3}\mathrm{\le}{\mathrm{\u25b3}}_{\mathrm{0}}$, there exists a family of MDPs and a feature extractor $\varphi $ that satisfy Assumption 3.1 with $\rho \mathrm{=}\frac{\mathrm{1}}{\mathrm{2}\mathit{}\mathrm{min}\mathit{}\mathrm{\{}H\mathrm{,}d\mathrm{\}}}$ and Assumption 4.4, such that any algorithm that returns a $\mathrm{1}\mathrm{/}\mathrm{4}$optimal policy with probability at least $\mathrm{0.9}$ needs to sample $\mathrm{\Omega}\mathit{}\mathrm{\left(}\mathrm{min}\mathit{}\mathrm{\{}{\mathrm{2}}^{H}\mathrm{,}{\mathrm{2}}^{d}\mathrm{\}}\mathrm{\right)}$ trajectories.
Again, our lower bound can be easily generalized to the case that $\mathcal{A}>2$.
Compared with Theorem 4.1, Theorem 4.2 is even more pessimistic, in the sense that even with perfect representation with benign properties (gap and margin), the agent still needs to sample exponential number of samples. It also suggests that policybased learning could be very different from supervised learning.
4.3 Proof Ideas
The binary tree hard instance.
All our lower bound are proved based on reductions from the following hard instance. In this instance, both the transition $P$ and the reward $R$ are deterministic, and there are two actions ${a}_{1}$ and ${a}_{2}$. There are $H$ levels of states, which form a full binary tree of depth $H$. Playing action ${a}_{1}$ transits a state to its left child while playing action ${a}_{2}$ transits a state to its right child. There are ${2}^{h}$ states in level $h$, and thus ${2}^{H}1$ states in total. Among all the ${2}^{H1}$ states in level $H1$, there is only one state with reward $R=1$, and for all other states in the MDP, the corresponding reward value $R=0$. Intuitively, to find a $1/2$optimal policy for such MDPs, the agent must enumerate all possible states in level $H1$ to find the state with reward $R=1$. Doing so intrinsically induces a sample complexity of $\mathrm{\Omega}({2}^{H})$. This intuition is formalized in Theorem 5.1 using Yao’s minimax principle (yao1977probabilistic).
Lower bound for valuebased learning
We now show how to construct a set of features so that Assumption 4.14.2 hold. Our main idea is to the utilize the following fact regarding the identity matrix: $\epsilon $$\mathrm{rank}({I}_{{2}^{H}})\le O(H/{\epsilon}^{2})$. Here for a matrix $A\in {\mathbb{R}}^{n\times n}$, its $\epsilon $$\mathrm{rank}$ (a.k.a approximate rank) is defined to be $\mathrm{min}\{\mathrm{rank}(B):B\in {\mathbb{R}}^{n\times n},{\parallel AB\parallel}_{\mathrm{\infty}}\le \epsilon \}$, where we use $\parallel \cdot {\parallel}_{\mathrm{\infty}}$ to denote the entrywise ${\mathrm{\ell}}_{\mathrm{\infty}}$ norm of a matrix. The upper bound $\epsilon $$\mathrm{rank}({I}_{n})\le O(\mathrm{log}n/{\epsilon}^{2})$ was first proved in alon2009perturbed using the JohnsonLindenstrauss Lemma (johnson1984extensions), and we also provide a proof in Lemma 5.1. The concept of $\epsilon $$\mathrm{rank}$ has wide applications in theoretical computer science (alon2009perturbed; barak2011rank; alon2013approximate; alon2014cover; chen2019classical), but to our knowledge, this is the first time that it appears in reinforcement learning.
This fact can be alternatively stated as follow: there exists $\mathrm{\Phi}\in {\mathbb{R}}^{{2}^{H}\times O(H/{\epsilon}^{2})}$ such that ${\parallel {I}_{{2}^{H}}\mathrm{\Phi}{\mathrm{\Phi}}^{\top}\parallel}_{\mathrm{\infty}}\le \epsilon $. We interpret each row of $\mathrm{\Phi}$ as the feature of a state in the binary tree. By construction of $\mathrm{\Phi}$, now features of states in the binary tree have a nice property that (i) each feature vector has approximately unit norm and (ii) different feature vector are nearly orthogonal. Using this set of features, we can now show that Assumption 4.1 and 4.2 hold. Here we prove Assumption 4.1 holds as an example and prove other assumptions also hold in the formal proof. To prove Assumption 4.1, we note that in the binary tree hard instance, for each level $h$, only a single state satisfies ${Q}^{*}=1$, and all other states satisfy ${Q}^{*}=0$. We simply take ${\theta}_{h}$ to be the feature of the state with ${Q}^{*}=1$. Since all feature vectors are nearly orthogonal, Assumption 4.1 holds.
Since the above fact regarding the $\epsilon $$\mathrm{rank}$ of the identity matrix can be proved by simply taking each row of $\mathrm{\Phi}$ to be a random unit vector, our lower bound reveals another intriguing (yet pessimistic) aspect of Assumption 4.1 and 4.2: for the binary tree instance, almost all feature extractors induce a hard MDP instance. This again suggests that a good representation itself may not necessarily lead to efficient RL and additional assumptions (e.g. on the reward distribution) could be crucial.
Lower bound for policybased learning.
It is straightfoward to construct a set of feature vectors for the binary tree instance so that Assumption 4.3 holds, even if $d=1$. We set $\varphi (s,a)$ to be $+1$ if $a={a}_{1}$ and $1$ if $a={a}_{2}$. For each level $h$, for the unique state $s$ in level $h$ with ${Q}^{*}=1$, we set ${\theta}_{h}$ to be $1$ if ${\pi}^{*}(s)={a}_{1}$ and $1$ if ${\pi}^{*}(s)={a}_{2}$. With this construction, Assumption 4.3 holds.
To prove that the lower bound under Assumption 4.4, we use a new reward function for states in level $H1$ in the binary tree instance above so that there exists a unique optimal action for each state in the MDP. See Figure 1 for an example with $H=3$ levels of states. Another nice property of the new reward function is that for all states $s$ we always have ${\pi}^{*}(s)={a}_{1}$. Now, we define ${2}^{H1}$ different new MDPs as follow: for each state in level $H1$, we change its original reward (defined in Figure 1) to $1$. An exponential sample complexity lower bound for these MDPs can be proved using the same argument as the original binary tree hard instance, and now we show this set of MDPs satisfy Assumption 4.4. We first show in Lemma 5.2 that there exists a set $\mathcal{N}\subseteq {\mathbb{S}}^{d1}$ with $\mathcal{N}={(1/\mathrm{\u25b3})}^{\mathrm{\Omega}(d)}$, so that for each $p\in \mathcal{N}$, there exists a hyperplane $L$ that separates $p$ and $\mathcal{N}\setminus \{p\}$, and all vectors in $\mathcal{N}$ have distance at least $\mathrm{\u25b3}$ to $L$. Equivalently, for each $p\in \mathcal{N},$we can always define a linear function ${f}_{p}$ so that ${f}_{p}(p)\ge \mathrm{\u25b3}$ and ${f}_{p}(q)\le \mathrm{\u25b3}$ for all $q\in \mathcal{N}\setminus \{p\}$. This can be proved using standard lower bounds on the size of $\epsilon $nets. Now we simply use vectors in $\mathcal{N}$ as features of states. By construction of the reward function, for each level $h$, there could only be two possible cases for the optimal policy ${\pi}^{*}$. I.e., either ${\pi}^{*}(s)={a}_{1}$ for all states in level $h$, or ${\pi}^{*}(s)={a}_{2}$ for a unique state $s$ and ${\pi}^{*}({s}^{\prime})={a}_{1}$ for all $s\ne {s}^{\prime}$. In both cases, we can easily define a linear function with margin $\mathrm{\u25b3}$ to implement the optimal policy ${\pi}^{*}$, and thus Assumption 4.4 holds.
5 Formal Proofs of Lower Bounds
In this section we present formal proofs of our lower bounds. We first introduce the INDEXQUERY problem, which will be useful in our lower bound arguments.
Definition 5.1 (INDEXQUERY).
In the ${\mathrm{INDQ}}_{n}$ problem, there is an underlying integer ${i}^{\mathrm{*}}\mathrm{\in}\mathrm{[}n\mathrm{]}$. The algorithm sequentially (and adaptively) outputs guesses $i\mathrm{\in}\mathrm{[}n\mathrm{]}$ and queries whether $i\mathrm{=}{i}^{\mathrm{*}}$. The goal is to output ${i}^{\mathrm{*}}$, using as few queries as possible.
Definition 5.2 ($\delta $correct algorithms).
For a real number $\delta \mathrm{\in}\mathrm{(}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{)}$, we say a randomized algorithm $\mathrm{A}$ is $\delta $correct for ${\mathrm{INDQ}}_{n}$, if for any underlying integer ${i}^{\mathrm{*}}\mathrm{\in}\mathrm{[}n\mathrm{]}$, with probability at least $\mathrm{1}\mathrm{}\delta $, $\mathrm{A}$ outputs ${i}^{\mathrm{*}}$.
The following theorem states the query complexity of ${\mathrm{\U0001d5a8\U0001d5ad\U0001d5a3\U0001d5b0}}_{n}$ for $0.1$correct algorithms, whose proof is provided in Section A.1.
Theorem 5.1.
Any $\mathrm{0.1}$correct algorithm $\mathrm{A}$ for ${\mathrm{INDQ}}_{n}$ requires at least $\mathrm{0.9}\mathit{}n$ queries in the worst case.
5.1 Proof of Lower Bound for Valuebased Learning
In this section we prove Theorem 4.1. We need the following existential result, whose proof is provided in Section A.2.
Lemma 5.1.
For any $n\mathrm{>}\mathrm{2}$, there exists a set of vectors $\mathrm{P}\mathrm{=}\mathrm{\{}{p}_{\mathrm{0}}\mathrm{,}{p}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{p}_{n\mathrm{}\mathrm{1}}\mathrm{\}}\mathrm{\subset}{\mathrm{R}}^{d}$ with $d\mathrm{\ge}\mathrm{\lceil}\mathrm{8}\mathit{}\mathrm{ln}\mathit{}n\mathrm{/}{\epsilon}^{\mathrm{2}}\mathrm{\rceil}$ such that

1.
${\parallel {p}_{i}\parallel}_{2}=1$ for all $0\le i\le n1$;

2.
$\left\u27e8{p}_{i},{p}_{j}\u27e9\right\le \epsilon $ for any $0\le i,j\le n1$ with $i\ne j$.
Now we give the construction of the hard MDP instances. We first define the transitions and the reward functions. In the hard instances, both the rewards and the transitions are deterministic. There are $H$ levels of states, and level $h\in [H]$ contains ${2}^{h}$ distinct states. Thus we have $\mathcal{S}={2}^{H}1$. If $\mathcal{S}>{2}^{H}1$ we simply add dummy states to the state space $\mathcal{S}$. We use ${s}_{0},{s}_{1},\mathrm{\dots},{s}_{{2}^{H}2}$ to name these states. Here, ${s}_{0}$ is the unique state in level $h=0$, ${s}_{1}$ and ${s}_{2}$ are the two states in level $h=1$, ${s}_{3}$, ${s}_{4}$, ${s}_{5}$ and ${s}_{6}$ are the four states in level $h=2$, etc. There are two different actions, ${a}_{1}$ and ${a}_{2}$, in the MDPs. For a state ${s}_{i}$ in level $h$ with $$, playing action ${a}_{1}$ transits state ${s}_{i}$ to state ${s}_{2i+1}$ and playing action ${a}_{2}$ transits state ${s}_{i}$ to state ${s}_{2i+2}$, where ${s}_{2i+1}$ and ${s}_{2i+2}$ are both states in level $h+1$. See Figure 2 for an example with $H=3$.
In our hard instances, $r(s,a)=0$ for all $(s,a)$ pairs except for a unique state $s$ in level $H2$ and a unique action $a\in \{{a}_{1},{a}_{2}\}$. It is convenient to define $\overline{r}({s}^{\prime})=r(s,a)$, if playing action $a$ transits $s$ to ${s}^{\prime}$. For our hard instances, we have $\overline{r}(s)=1$ for a unique node $s$ in level $H1$ and $\overline{r}(s)=0$ for all other nodes.
Now we define the features map $\varphi (\cdot ,\cdot )$. Here we assume $d\ge 2\cdot \lceil 8\mathrm{ln}2\cdot H/{\delta}^{2}\rceil $, and otherwise we can simply decrease the planning horizon so that $d\ge 2\cdot \lceil 8\mathrm{ln}2\cdot H/{\delta}^{2}\rceil $. We invoke Lemma 5.1 to get a set $\mathcal{P}=\{{p}_{0},{p}_{1},\mathrm{\dots},{p}_{{2}^{H}1}\}\subset {\mathbb{R}}^{d/2}$. For each state ${s}_{i}$, $\varphi ({s}_{i},{a}_{1})\in {\mathbb{R}}^{d}$ is defined to be $[{p}_{i};0]$, and $\varphi ({s}_{i},{a}_{2})\in {\mathbb{R}}^{d}$ is defined to be $[0;{p}_{i}]$. This finishes the definition of the MDPs. We now show that no matter which state $s$ in level $H1$ satisfies $\overline{r}(s)=1$, the resulting MDP always satisfies Assumption 4.2.
Verifying Assumption 4.2.
By construction, for each level $h\in [H]$, there is a unique state ${s}_{h}$ in level $h$ and action ${a}_{h}\in \{{a}_{1},{a}_{2}\}$, such that ${Q}^{*}({s}_{h},{a}_{h})=1$. For all other $(s,a)$ pairs such that $s\ne {s}_{h}$ or $a\ne {a}_{h}$, it is satisfied that ${Q}^{*}(s,a)=0$. For a given level $h$ and policy $\pi $, we take ${\theta}_{h}^{\pi}$ to be ${Q}^{\pi}({s}_{h},{a}_{h})\cdot \varphi ({s}_{h},{a}_{h})$. Now we show that ${Q}^{\pi}(s,a)\u27e8{\theta}_{h}^{\pi},\varphi (s,a)\u27e9\le \delta $ for all states $s$ in level $h$ and $a\in \{{a}_{1},{a}_{2}\}$.
 Case I: $a\mathrm{\ne}{a}_{h}$.

In this case, we have ${Q}^{\pi}(s,a)=0$ and $\u27e8{\theta}_{h}^{\pi},\varphi (s,a)\u27e9=0$, since ${\theta}_{h}^{\pi}$ and $\varphi (s,a)$ do not have a common nonzero coordinate.
 Case II: $a\mathrm{=}{a}_{h}$ and $s\mathrm{\ne}{s}_{h}$.

In this case, by the second property of $\mathcal{P}$ in Lemma 5.1 and the fact that ${Q}^{\pi}({s}_{h},{a}_{h})\le 1$, we have $\u27e8{\theta}_{h}^{\pi},\varphi (s,a)\u27e9\le \delta $. Meanwhile, we have ${Q}^{\pi}(s,a)=0$.
 Case III: $a\mathrm{=}{a}_{h}$ and $s\mathrm{=}{s}_{h}$.

In this case, we have $\u27e8{\theta}_{h}^{\pi},\varphi (s,a)\u27e9={Q}^{\pi}({s}_{h},{a}_{h})$.
Finally, we prove any algorithm that solves these MDP instances and succeeds with probability at least $0.9$ needs to sample at least $\frac{9}{20}\cdot {2}^{H}$ trajectories. We do so by providing a reduction from ${\mathrm{\U0001d5a8\U0001d5ad\U0001d5a3\U0001d5b0}}_{{2}^{H1}}$ to solving MDPs. Suppose we have an algorithm for solving these MDPs, we show that such an algorithm can be transformed to solve ${\mathrm{\U0001d5a8\U0001d5ad\U0001d5a3\U0001d5b0}}_{{2}^{H1}}$. For a specific choice of ${i}^{*}$ in ${\mathrm{\U0001d5a8\U0001d5ad\U0001d5a3\U0001d5b0}}_{{2}^{H1}}$, there is a corresponding MDP instance with
$$\overline{r}(s)=\{\begin{array}{cc}1\hfill & \text{if}s={s}_{{i}^{*}+{2}^{H1}1}\hfill \\ 0\hfill & \text{otherwise}\hfill \end{array}.$$ 
Notice that for all MDPs that we are considering, the transition and features are always the same. Thus, the only thing that the learner needs to learn by interacting with the environment is the reward value. Since the reward value is nonzero only for states in level $H1$, each time the algorithm for solving MDP samples a trajectory that ends at state ${s}_{i}$ where ${s}_{i}$ is a state in level $H1$, we query whether ${i}^{*}=i{2}^{H1}+1$ or not in ${\mathrm{\U0001d5a8\U0001d5ad\U0001d5a3\U0001d5b0}}_{{2}^{H1}}$, and return reward value 1 if ${i}^{*}=i{2}^{H1}+1$ and 0 otherwise. If the algorithm is guaranteed to return a $1/2$optimal policy, then it must be able to find ${i}^{*}$.
5.2 Proof of Lower Bound for Policybased Learning
In this section, we present our hardness results for linear policy learning. In order to prove Theoerem 4.2, we need the following geometric lemma whose proof is provided in Section A.3.
Lemma 5.2.
Let $d\mathrm{\in}{\mathrm{N}}_{\mathrm{+}}$ be a positive integer and $\u03f5\mathrm{\in}\mathrm{(}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{)}$ be a real number. Then there exists a set of points $\mathrm{N}\mathrm{\subset}{\mathrm{SS}}^{d\mathrm{}\mathrm{1}}$ with size $\mathrm{}\mathrm{N}\mathrm{}\mathrm{=}\mathrm{\Omega}\mathit{}\mathrm{(}\mathrm{1}\mathrm{/}{\u03f5}^{d\mathrm{/}\mathrm{2}}\mathrm{)}$ such that for every point $x\mathrm{\in}\mathrm{N}$,
$\underset{y\in \mathrm{conv}(\mathcal{N}\backslash \{x\})}{inf}{\parallel xy\parallel}_{2}\ge \u03f5/2.$ 
Now we are ready to prove Theorem 4.2. In the proof we assume $H=d$, since otherwise we can take $H$ and $d$ to be $\mathrm{min}\{H,d\}$ by decreasing the planning horizon $H$ or adding dummy dimensions to the feature extractor $\varphi $.
We define a set of ${2}^{H1}$ deterministic MDPs. The transitions of these hard instances are exactly the same as those in Section 5.1. The main difference is in the definition of the feature map $\varphi (\cdot ,\cdot )$ and the reward function. Again in the hard instances, $r(s,a)=0$ for all $s$ in the first $H2$ levels. Using the terminology in Section 5.1, we have $\overline{r}(s)=0$ for all states in the first $H1$ levels. Now we define $\overline{r}(s)$ for states $s$ in level $H1$. We do so by recursively defining the optimal value function ${V}^{*}(\cdot )$. The initial state ${s}_{0}$ in level $0$ satisfies ${V}^{*}({s}_{0})=1/2$. For each state ${s}_{i}$ in the first $H2$ levels, we have ${V}^{*}({s}_{2i+1})={V}^{*}({s}_{i})$ and ${V}^{*}({s}_{2i+2})={V}^{*}({s}_{i})1/2H$. For each state ${s}_{i}$ in the level $h=H2$, we have $\overline{r}({s}_{2i+1})={V}^{*}({s}_{i})$ and $\overline{r}({s}_{2i+2})={V}^{*}({s}_{i})1/2H$. This implies that $\rho =1/2H$. In fact, this implies a stronger property that each state has a unique optimal action. See Figure 1 for an example with $H=3$.
To define ${2}^{H1}$ different MDPs, for each state $s$ in level $H1$ of the MDP defined above, we define a new MDP by changing $\overline{r}(s)$ from its original value to $1$. This also affects the definition of the optimal $V$ function for states in the first $H1$ levels. In particular, for each level $i\in \{0,1,2,\mathrm{\dots},H2\}$, we have changed the $V$ value of a unique state in level $i$ from its original value (at most $1/2$) to $1$. By doing so we have defined ${2}^{H1}$ different MDPs. See Figure 3 for an example with $H=3$.
Now we define the feature function $\varphi (\cdot ,\cdot )$. We invoke Lemma 5.2 with $\u03f5=8\mathrm{\u25b3}$ and $d=H/21$. Since $\mathrm{\u25b3}$ is sufficiently small, we have $\mathcal{N}\ge {2}^{H}$. We use $\mathcal{P}=\{{p}_{0},{p}_{2},\mathrm{\dots},{p}_{{2}^{H}1}\}\subset {\mathbb{R}}^{H/21}$ to denote an arbitrary subset of $\mathcal{N}$ with cardinality ${2}^{H}$. By Lemma 5.2, for any $p\in \mathcal{P}$, the distance between $p$ and the convex hull of $\mathcal{P}\setminus \{p\}$ is at least $4\mathrm{\u25b3}$. Thus, there exists a hyperplane $L$ which separates $p$ and $\mathcal{P}\setminus \{p\}$, and for all points $q\in \mathcal{P}$, the distance between $q$ and $L$ is at least $2\mathrm{\u25b3}$. Equivalently, for each point $p\in \mathcal{P}$, there exists ${n}_{p}\in {\mathbb{R}}^{H/21}$ and ${o}_{p}\in \mathbb{R}$ such that ${\parallel {n}_{p}\parallel}_{2}=1$, ${o}_{p}\le 1$ and the linear function ${f}_{p}(q)=\u27e8q,{n}_{p}\u27e9+{o}_{p}$ satisfies ${f}_{p}(p)\ge 2\mathrm{\u25b3}$ and ${f}_{p}(q)\le 2\mathrm{\u25b3}$ for all $q\in \mathcal{P}\setminus \{p\}$. Given the set $\mathcal{P}=\{{p}_{0},{p}_{2},\mathrm{\dots},{p}_{{2}^{H}1}\}\subset {\mathbb{R}}^{H/21}$, we construct a new set $\overline{\mathcal{P}}=\{{\overline{p}}_{0},{\overline{p}}_{2},\mathrm{\dots},{\overline{p}}_{{2}^{H}1}\}\subset {\mathbb{R}}^{H/2}$, where ${\overline{p}}_{i}=[{p}_{i};1]\in {\mathbb{R}}^{H/2}$. Thus ${\parallel {\overline{p}}_{i}\parallel}_{2}=\sqrt{2}$ for all ${\overline{p}}_{i}\in \overline{\mathcal{P}}$. Clearly, for each $\overline{p}\in \overline{\mathcal{P}}$, there exists a vector ${\omega}_{\overline{p}}\in {\mathbb{R}}^{H/2}$ such that $\u27e8{\omega}_{\overline{p}},\overline{p}\u27e9\ge 2\mathrm{\u25b3}$ and $\u27e8{\omega}_{\overline{p}},\overline{q}\u27e9\le 2\mathrm{\u25b3}$ for all $\overline{q}\in \overline{\mathcal{P}}\setminus \{\overline{p}\}$. It is also clear that ${\parallel {\omega}_{\overline{p}}\parallel}_{2}\le \sqrt{2}$. We take $\varphi ({s}_{i},{a}_{1})=[0;{\overline{p}}_{i}]\in {\mathbb{R}}^{H}$ and $\varphi ({s}_{i},{a}_{2})=[{\overline{p}}_{i};0]\in {\mathbb{R}}^{H}$.
We now show that all the ${2}^{H1}$ MDPs constructed above satisfy the linear policy assumption. Namely, we show that for any state $s$ in level $H1$, after changing $\overline{r}(s)$ to be 1, the resulting MDP satisfies the linear policy assumption. As in Section 5.1, for each level $h\in [H]$, there is a unique state ${s}_{h}$ in level $h$ and action ${a}_{h}\in \{{a}_{1},{a}_{2}\}$, such that ${Q}^{*}({s}_{h},{a}_{h})=1$. For all other $(s,a)$ pairs such that $s\ne {s}_{h}$ or $a\ne {a}_{h}$, it is satisfied that ${Q}^{*}(s,a)=0$. For each level $h$, if ${a}_{h}={a}_{1}$, then we take ${({\theta}_{h})}_{H/2}=1$ and ${({\theta}_{h})}_{H}=1$, and all other entries in ${\theta}_{h}$ are zeros. If ${a}_{h}={a}_{2}$, we use $\overline{p}$ to denote the vector formed by the first $H/2$ coordinates of $\varphi ({s}_{h},{a}_{2})$. By construction, we have $\overline{p}\in \overline{\mathcal{P}}$. We take ${\theta}_{h}=[{\omega}_{\overline{p}};0]$ in this case. In any case, we have ${\parallel {\theta}_{h}\parallel}_{2}\le \sqrt{2}$. Now for each level $h$, if ${a}_{h}={a}_{1}$, then for all states $s$ in level $h$, we have ${\pi}^{*}(s)={a}_{1}$. In this case, $\u27e8\varphi (s,{a}_{1}),{\theta}_{h}\u27e9=1$ and $\u27e8\varphi (s,{a}_{2}),{\theta}_{h}\u27e9=1$ for all states in level $h$, and thus Assumption 4.4 is satisfied. If ${a}_{h}={a}_{2}$, then ${\pi}^{*}({s}_{h})={a}_{2}$ and ${\pi}^{*}(s)={a}_{1}$ for all states $s\ne {s}_{h}$ in level $h$. By construction, we have $\u27e8{\theta}_{h},\varphi (s,{a}_{1})\u27e9=0$ for all states $s$ in level $h$, since ${\theta}_{h}$ and $\varphi (s,{a}_{1})$ do not have a common nonzero entry. We also have $\u27e8{\theta}_{h},\varphi ({s}_{h},{a}_{2})\u27e9\ge 2\mathrm{\u25b3}$ and $\u27e8{\theta}_{h},\varphi (s,{a}_{2})\u27e9\le 2\mathrm{\u25b3}$ for all states $s\ne {s}_{h}$ in level $h$. Finally, we normalize all ${\theta}_{h}$ and $\varphi (s,a)$ so that they all have unit norm. Since ${\parallel \varphi (s,a)\parallel}_{2}=\sqrt{2}$ for all $(s,a)$ pairs before normalization, Assumption 4.4 is still satisfied after normalization.
Finally, we prove any algorithm that solves these MDP instances and succeeds with probability at least $0.9$ needs to sample at least $\mathrm{\Omega}({2}^{H})$ trajectories. We do so by providing a reduction from ${\mathrm{\U0001d5a8\U0001d5ad\U0001d5a3\U0001d5b0}}_{{2}^{H1}}$ to solving MDPs. Suppose we have an algorithm for solving these MDPs, we show that such an algorithm can be transformed to solve ${\mathrm{\U0001d5a8\U0001d5ad\U0001d5a3\U0001d5b0}}_{{2}^{H1}}$. For a specific choice of ${i}^{*}$ in ${\mathrm{\U0001d5a8\U0001d5ad\U0001d5a3\U0001d5b0}}_{{2}^{H1}}$, there is a corresponding MDP instance with
$$\overline{r}(s)=\{\begin{array}{cc}1\hfill & \text{if}s={s}_{{i}^{*}+{2}^{H1}1}\hfill \\ \text{the original (recursively defined) value}\hfill & \text{otherwise}\hfill \end{array}.$$ 
Notice that for all MDPs that we are considering, the transition and features are always the same. Thus, the only thing that the learner needs to learn by interacting with the environment is the reward value. Since the reward value is nonzero only for states in level $H1$, each time the algorithm for solving MDP samples a trajectory that ends at state ${s}_{i}$ where ${s}_{i}$ is a state in level $H1$, we query whether ${i}^{*}=i{2}^{H1}+1$ or not in ${\mathrm{\U0001d5a8\U0001d5ad\U0001d5a3\U0001d5b0}}_{{2}^{H1}}$, and return reward value 1 if ${i}^{*}=i{2}^{H1}+1$ and it original reward value otherwise. If the algorithm is guaranteed to return a $1/4$optimal policy, then it must be able to find ${i}^{*}$.
6 Discussion
6.1 Separations
Perfect representation vs. goodbutnotperfect representation. For valuebased learning in deterministic systems, wen2013efficient showed polynomial sample complexity upper bound when the representation can perfectly predict the $Q$function. In contrast, if the representation is only able to approximate the $Q$function, then the agent requires exponential number of trajectories. This exponential separation demonstrates a provable exponential benefit of better representation.
Valuebased learning vs. policybased learning. Note that if the optimal $Q$function can be perfectly predicted by the provided representation, then the optimal policy can also be perfectly predicted using the same representation. Since wen2013efficient showed polynomial sample complexity upper bound when the representation can perfectly predict the $Q$function, our lower bound on policybased learning thus demonstrates that the ability of predicting the $Q$function is much stronger than that of predicting the optimal policy.
Supervised learning vs. reinforcement learning. For policybased learning, if the planning horizon $H=1$, the problem becomes learning a linear classifier, for which there are polynomial sample complexity upper bounds. For policybased learning, the agent needs to learn $H$ linear classifiers sequentially. Our lower bound on policybased learning shows the sample complexity dependency on $H$ is exponential.
Imitation learning vs. reinforcement learning. In imitation learning (IL), the agent can observe trajectories induced by the optimal policy (expert). If the optimal policy is linear in the given representation, it can be shown that the simple behavior cloning algorithm only requires polynomial number of samples to find a nearoptimal policy (ross2011reduction). Our Theorem 4.2 shows if the agent cannot observe expert’s behavior, then it requires exponential number of samples. Therefore, our lower bound shows there is an exponential separation between policybased RL and IL when function approximation is used.
6.2 Lower Bounds for Modelbased Learning
Finally, we remark that using the technique for proving the lower bound for valuebased learning, we can obtain a lower bound for “linear MDPs” in which the transition probability matrix can be approximated by a linear function of the representation. Section D shows that if the transition matrix is only approximated in the ${\mathrm{\ell}}_{\mathrm{\infty}}$ sense, then the agent still requires an exponential number of samples. We do note that an ${\mathrm{\ell}}_{\mathrm{\infty}}$ approximation for a transition matrix may be a weak condition. Under the stronger condition that the transition matrix can be approximated well under the total variational distance (${\mathrm{\ell}}_{1}$ distance), then there exists polynomial sample complexity upper bounds that can tolerate approximation error (yang2019sample; yang2019reinforcement; jin2019provably).
7 Acknowledgments
The authors would like to thank Yuping Luo, Wenlong Mou, Martin Wainwright, Mengdi Wang and Yifan Wu for insightful discussions. Simon S. Du is supported by National Science Foundation (Grant No. DMS1638352) and the Infosys Membership. Ruosong Wang is supported in part by NSF grant IIS1763562 and Office of Naval Research grant N000141812861. Sham M. Kakade acknowledges funding from the Washington Research Foundation Fund for Innovation in DataIntensive Discovery; the NSF award CCF 1740551; and the ONR award N000141812247.
References
Appendix A Technical Proofs
A.1 Proof of Theorem 5.1
Proof.
The proof is a straightforward application of Yao’s minimax principle yao1977probabilistic. We provide the full proof for completeness.
Consider an input distribution where ${i}^{*}$ is drawn uniformly at random from $[n]$. Suppose there is a $0.1$correct algorithm for ${\mathrm{\U0001d5a8\U0001d5ad\U0001d5a3\U0001d5b0}}_{n}$ with worst case query complexity $T$ such that $$. By averaging, there is a deterministic algorithm ${\mathcal{A}}^{\prime}$ with worst case query complexity $T$, such that
$$\underset{i\sim [n]}{\mathrm{Pr}}[{\mathcal{A}}^{\prime}\text{correctly outputs}i\text{when}{i}^{*}=i]\ge 0.9.$$ 
We may assume that the sequence of queries made by ${\mathcal{A}}^{\prime}$ is fixed. This is because (i) ${\mathcal{A}}^{\prime}$ is deterministic and (ii) before ${\mathcal{A}}^{\prime}$ correctly guesses ${i}^{*}$, all responses that ${\mathcal{A}}^{\prime}$ receives are the same (i.e., all guesses are incorrect). We use $S=\{{s}_{1},{s}_{2},\mathrm{\dots},{s}_{m}\}$ to denote the sequence of queries made by ${\mathcal{A}}^{\prime}$. Notice that $m$ is the worst case query complexity of ${\mathcal{A}}^{\prime}$. Suppose $$, there exist $0.1n$ distinct $i\in [n]$ such that ${\mathcal{A}}^{\prime}$ will never guess $i$, and will be incorrect if ${i}^{*}$ equals $i$, which implies
$$ 
∎
A.2 Proof of Lemma 5.1
We need the following tail inequality for random unit vectors, which will be useful for the proof of Lemma 5.1.
Lemma A.1 (Lemma 2.2 in dasgupta2003elementary).
For a random unit vector $u$ in ${\mathrm{R}}^{d}$ and $\beta \mathrm{>}\mathrm{1}$, we have
$$\mathrm{Pr}\left[{u}_{1}^{2}\ge \beta /d\right]\le \mathrm{exp}((1+\mathrm{ln}\beta \beta )/2).$$ 
In particular, when $\beta \mathrm{\ge}\mathrm{6}$,we have
$$\mathrm{Pr}\left[{u}_{1}^{2}>\beta /d\right]\le \mathrm{exp}(\beta /4).$$ 
Proof of Lemma 5.1.
Let $\mathcal{Q}=\{{q}_{1},{q}_{2},\mathrm{\dots},{q}_{n}\}$ be a set of $n$ independent random unit vectors in ${\mathbb{R}}^{d}$ with $d\ge \lceil 8\mathrm{ln}n/{\epsilon}^{2}\rceil $. We will prove that with probability at least $1/2$, $\mathcal{Q}$ satisfies the two desired properties as stated in Lemma 5.1. This implies the existence of such set $\mathcal{P}$.
It is clear that ${\parallel {q}_{i}\parallel}_{2}=1$ for all $i\in [n]$, since each ${q}_{i}$ is drawn from the unit sphere. We now prove that for any $i,j\in [n]$ with $i\ne j$, with probability at least $1\frac{1}{{n}^{2}}$, we have $\left\u27e8{q}_{i},{q}_{j}\u27e9\right\le \epsilon $. Notice that this is sufficient to prove the lemma, since by a union bound over all the $\left(\genfrac{}{}{0pt}{}{n}{2}\right)=n(n1)/2$ possible pairs of $(i,j)$, this implies that $\mathcal{Q}$ satisfies the two desired properties with probability at least $1/2$.
Now, we prove that for two independent random unit vectors $u$ and $v$ in ${\mathbb{R}}^{d}$ with $d\ge \lceil 8\mathrm{ln}n/{\epsilon}^{2}\rceil $, with probability at least $1\frac{1}{{n}^{2}}$, $\left\u27e8u,v\u27e9\right\le \epsilon $. By rotational invariance, we assume that $v$ is a standard basis vector. I.e., we assume ${v}_{1}=1$ and ${v}_{i}=0$ for all $$. Notice that now $\u27e8u,v\u27e9$ is the magnitude of the first coordinate of $u$. We finish the proof by invoking Lemma A.1 and taking $\beta =8\mathrm{ln}n>6$. ∎
A.3 Proof of Lemma 5.2
Proof of Lemma 5.2.
Consider a $\sqrt{\u03f5}$packing $\mathcal{N}$ with size $\mathrm{\Omega}(1/{\u03f5}^{d/2})$ on the $d$dimensional unit sphere ${\mathrm{SS}}^{d1}$ (for the existence of such a packing, see, e.g., lorentz1966metric). Let $o$ be the origin. For two points $x,{x}^{\prime}\in {\mathbb{R}}^{d}$, we denote $x{x}^{\prime}:={\parallel x{x}^{\prime}\parallel}_{2}$ the length of the line segment between $x,{x}^{\prime}$. Note that every two points $x,{x}^{\prime}\in \mathcal{N}$ satisfy $x{x}^{\prime}\ge \sqrt{\u03f5}$.
To prove the lemma, it suffices to show that $\mathcal{N}$ satisfies the desired property. Consider a point $x\in \mathcal{N}$, let $A$ be a hyperplane that is perpendicular to $x$ (notice that $x$ is a also a vector) and separates $x$ and every other points in $\mathcal{N}$. We let the distance between $x$ and $A$ be the largest possible, i.e., $A$ contains a point in $\mathcal{N}\backslash \{x\}$. Since $x$ is on the unit sphere and $\mathcal{N}$ is a $\sqrt{\u03f5}$packing, we have that $x$ is at least $\sqrt{\u03f5}$ away from every point on the spherical cap not containing $x$, defined by the cutting plane $A$. More formally, let $b$ be the intersection point of the line segment $ox$ and $A$. Then
$$\forall y\in \{{y}^{\prime}\in {\mathrm{SS}}^{ds}:\u27e8b,{y}^{\prime}\u27e9\le \parallel b{\parallel}_{2}^{2}\}:\mathit{\hspace{1em}}\parallel xy{\parallel}_{2}\ge \sqrt{\u03f5}.$$ 
Indeed, by symmetry, $\forall y\in \{{y}^{\prime}\in {\mathrm{SS}}^{d1}:\u27e8b,{y}^{\prime}\u27e9\le {\parallel b\parallel}_{2}^{2}\}$,
$${\parallel xy\parallel}_{2}\ge {\parallel xz\parallel}_{2}\ge \sqrt{\u03f5}.$$ 
where $z\in \mathcal{N}\cap A$. Notice that the distance between $x$ and the convex hull of $\mathcal{N}\backslash \{x\}$ is lower bounded by the distance between $x$ and $A$, which is given by $bx$. Consider the triangles defined by $x,z,o,b$. We have $bz\u27c2ox$ (note that $bz$ lies inside $A$). By Pythagorean theorem, we have
${bz}^{2}+{bx}^{2}$  $={xz}^{2};$  
$bx+bo$  $=xo=1;$  
${bz}^{2}+{bo}^{2}$  $={oz}^{2}=1.$ 
Solve the above three equations for $bx$, we have
$$bx={xz}^{2}/2\ge \u03f5/2$$ 
as desired. ∎
Appendix B Exact Linear ${Q}^{*}$ with Optimality Gap in Generative Model
In this section we prove the following theorem.
Theorem B.1.
Under Assumption 3.1 and Assumption 4.2 with $\delta \mathrm{=}\mathrm{0}$, in the Generative Model query model, there is an algorithm that finds ${\pi}^{\mathrm{*}}$ with $\mathrm{poly}\mathit{}\mathrm{(}d\mathrm{,}H\mathrm{,}\frac{\mathrm{1}}{\rho}\mathrm{)}$ trajectories with probability $\mathrm{0.99}$.
Proof of Theorem B.1.
We first describe the algorithm. For each level $h\in [H]$, the agent first constructs a barycentric spanner ${\mathrm{\Lambda}}_{h}\triangleq \{\varphi ({s}_{h}^{1},{a}_{h}^{1}),\mathrm{\dots}\varphi ({s}_{h}^{d},{a}_{h}^{d})\}\subset {\mathrm{\Phi}}_{h}\triangleq {\left\{\varphi (s,a)\right\}}_{s\in {\mathcal{S}}_{h},a\in \mathcal{A}}$. See awerbuch2008online for the definition of barycentric spanner and its construction. It holds that any $\varphi (s,a)$ with ${s}_{h}\in {\mathcal{S}}_{h},a\in \mathcal{A}$, we have ${c}_{s,a}^{1},\mathrm{\dots},{c}_{s,a}^{d}\in [1,1]$ such that $\varphi (s,a)={\sum}_{i=1}^{d}{c}_{s,a}^{i}\varphi ({s}_{h}^{i},{a}_{h}^{i})$.
The algorithm learns the optimal policy from $h=H1$ to $h=0$. At any level $h\in [H]$, we assume the agent has learned the optimal policy ${\pi}_{{h}^{\prime}}^{*}$ at level ${h}^{\prime}=h+1,\mathrm{\dots},H1$.
Now we present a procedure to learn the optimal policy at level $h$. At level $h$, the agent queries every vector $\varphi ({s}_{h}^{i},{a}_{h}^{i})$ in ${\mathrm{\Lambda}}_{h}$ for $\mathrm{poly}(d,\frac{1}{\rho})$ times and uses ${\pi}_{h+1}^{*},\mathrm{\dots},{\pi}_{H1}^{*}$ as the rollout to get the onthego reward. Note by the definition of ${\pi}^{*}$ and ${Q}^{*}$, the onthego reward is an unbiased sample of ${Q}^{*}({s}_{h}^{i},{a}_{h}^{i})$. We denote $\widehat{Q}({s}_{h}^{i},{a}_{h}^{i})$ the average of these onthego rewards. By Hoeffding inequality, it is easy to show with probability $1\frac{0.01}{H}$, for all $i=1,\mathrm{\dots},d$, $\left\widehat{Q}({s}_{h}^{i},{a}_{h}^{i}){Q}^{*}({s}_{h}^{i},{a}_{h}^{i})\right\le \mathrm{poly}(\frac{1}{d},\rho )$. Now we define our estimated ${Q}^{*}$ at level $h$ as follow: for any $(s,a)\in {\mathcal{S}}_{h}\times \mathcal{A}$, $\widehat{Q}(s,a)={\sum}_{i=1}^{d}{c}_{s,a}^{i}\widehat{Q}({s}_{h}^{i},{a}_{h}^{i})$. By the boundedness property of ${c}_{s,a}$, we know for any $(s,a)\in {\mathcal{S}}_{h}\times \mathcal{A}$, $$. Note this implies the policy induced by $\widehat{Q}$ is the same as ${\pi}^{*}$. We finish the proof by induction. ∎
Appendix C Linear ${Q}^{\pi}$ for all $\pi $ in Generative Model
In this section we present and prove the following theorem.
Theorem C.1.
Under Assumption 4.2 with $\delta \mathrm{=}\mathrm{0}$, in the Generative Model query model, there is an algorithm that finds an $\u03f5$optimal policy $\widehat{\pi}$ using $\mathrm{poly}\mathit{}\mathrm{(}d\mathrm{,}H\mathrm{,}\frac{\mathrm{1}}{\u03f5}\mathrm{)}$ trajectories with probability $\mathrm{0.99}$.
Proof of Theorem C.1.
The algorithm is the same as the one in Theorem B.1 We only need to change the analysis. Suppose we are learning at level $h$ and we have learned policies ${\pi}_{h+1},\mathrm{\dots},{\pi}_{H1}$ for level $h+1,h+2,\mathrm{\dots},H1$, respectively. Because we use the rollout policy ${\pi}_{h+1}\circ \mathrm{\cdots}\circ {\pi}_{H1}$, by Assumption 4.2 and the property of barycentric spanner, using the same argument in the proof of Theorem B.1, we know with probability $10.01/H$, we can learn a policy ${\pi}_{h}$ with $\mathrm{poly}(d,H,\frac{1}{\u03f5})$ samples such that for any $s\in {\mathcal{S}}_{h}$, we know ${\pi}_{h}$ is only suboptimal by $\frac{\u03f5}{H}$ from the ${\stackrel{~}{\pi}}_{h}$ where ${\stackrel{~}{\pi}}_{h}$ is the optimal policy at level $h$ such that ${\pi}_{h+1}\circ \mathrm{\cdots}\circ {\pi}_{H1}$ is the fixed rollout policy.
Now we can bound the suboptimality of $\widehat{\pi}\triangleq {\pi}_{0}\circ \mathrm{\cdots}\circ {\pi}_{H1}$:
${V}^{{\pi}_{0}\circ {\pi}_{1}\circ \mathrm{\cdots}\circ {\pi}_{H1}}\left({s}_{1}\right){V}^{{\pi}_{0}^{*}\circ {\pi}_{1}^{*}\circ \mathrm{\cdots}\circ {\pi}_{H1}^{*}}\left({s}_{1}\right)$  
$=$  ${V}^{{\pi}_{0}\circ {\pi}_{1}\circ \mathrm{\cdots}\circ {\pi}_{H1}}\left({s}_{1}\right){V}^{{\stackrel{~}{\pi}}_{0}\circ {\pi}_{1}\circ \mathrm{\cdots}\circ {\pi}_{H1}}\left({s}_{1}\right)$  
$+$  ${V}^{{\stackrel{~}{\pi}}_{0}\circ {\pi}_{1}\circ \mathrm{\cdots}\circ {\pi}_{H1}}\left({s}_{1}\right){V}^{{\pi}_{0}^{*}\circ {\pi}_{1}\circ \mathrm{\cdots}\circ {\pi}_{H1}}({s}_{1})$  
$+$  ${V}^{{\pi}_{0}^{*}\circ {\pi}_{1}\circ \mathrm{\cdots}\circ {\pi}_{H1}}({s}_{1}){V}^{{\pi}_{0}^{*}\circ {\pi}_{1}^{*}\circ \mathrm{\cdots}\circ {\pi}_{H1}^{*}}\left({s}_{1}\right).$ 
The first term is at least $\frac{\u03f5}{H}$ by our estimation bound, The second term is positive by definition of ${\stackrel{~}{\pi}}_{0}$. We can just recursively apply this argument to obtain
${V}^{{\pi}_{0}\circ {\pi}_{1}\circ \mathrm{\cdots}\circ {\pi}_{H1}}\left({s}_{1}\right){V}^{{\pi}_{0}^{*}\circ {\pi}_{1}^{*}\circ \mathrm{\cdots}\circ {\pi}_{H1}^{*}}\left({s}_{1}\right)\ge $  ${V}^{{\pi}_{0}^{*}\circ {\pi}_{1}\circ \mathrm{\cdots}\circ {\pi}_{H1}}({s}_{1}){V}^{{\pi}_{0}^{*}\circ {\pi}_{1}^{*}\circ \mathrm{\cdots}\circ {\pi}_{H1}^{*}}\left({s}_{1}\right){\displaystyle \frac{\u03f5}{H}}.$  
$\ge $  ${V}^{{\pi}_{0}^{*}\circ {\pi}_{1}^{*}\circ \mathrm{\cdots}\circ {\pi}_{H1}}({s}_{1}){V}^{{\pi}_{0}^{*}\circ {\pi}_{1}^{*}\circ \mathrm{\cdots}\circ {\pi}_{H1}^{*}}\left({s}_{1}\right){\displaystyle \frac{2\u03f5}{H}}.$  
$\ge $  $\mathrm{\dots}$  
$\ge $  $\u03f5.$ 
∎
Appendix D Lower Bound for Modelbased Learning
Here we present our lower bound for modelbased learning. Recently, yang2019sample proposed the linear transition assumption which was later studied in yang2019reinforcement, jin2019provably. Under this assumption, yang2019sample, yang2019reinforcement, jin2019provably developed algorithms with polynomial sample complexity. Again, we assume the agent is given a feature extractor $\varphi :\mathcal{S}\times \mathcal{A}\to {\mathbb{R}}^{d}$, and now we state the assumption formally as follow.
Assumption D.1 (Approximate Linear MDP).
There exists $\delta \mathrm{>}\mathrm{0}$, ${\beta}_{\mathrm{0}}\mathrm{,}{\beta}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{\beta}_{H\mathrm{}\mathrm{1}}\mathrm{\in}{\mathrm{R}}^{d}$ and $\psi \mathrm{:}\mathrm{S}\mathrm{\to}{\mathrm{R}}^{d}$ such that for any $h\mathrm{\in}\mathrm{[}H\mathrm{}\mathrm{1}\mathrm{]}$, $\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{\in}{\mathrm{S}}_{h}\mathrm{\times}\mathrm{A}$ and ${s}^{\mathrm{\prime}}\mathrm{\in}{\mathrm{S}}_{h\mathrm{+}\mathrm{1}}$, $\mathrm{}P\mathrm{(}{s}^{\mathrm{\prime}}\mathrm{\mid}s\mathrm{,}a\mathrm{)}\mathrm{}\mathrm{\u27e8}\psi \mathrm{(}{s}^{\mathrm{\prime}}\mathrm{)}\mathrm{,}\varphi \mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{\u27e9}\mathrm{}\mathrm{\le}\delta $ and $\mathrm{\left}\mathrm{E}\mathit{}\mathrm{[}R\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{]}\mathrm{}\mathrm{\u27e8}{\beta}_{h}\mathrm{,}\varphi \mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{\u27e9}\mathrm{\right}\mathrm{\le}\delta $.
It has been shown in yang2019sample, yang2019reinforcement, jin2019provably if $\parallel P(\cdot \mid s,a)\u27e8\psi (\cdot ),\varphi (s,a)\u27e9{\parallel}_{1}$ is bounded, then the problem admits an algorithm with polynomial sample complexity. Now we show that when $\delta =\mathrm{\Omega}\left(\sqrt{\frac{H}{d}}\right)$ in Assumption D.1, the agent needs exponential number of samples to find a nearoptimal policy.
Theorem D.1 (Exponential Lower Bound for Linear Transition Model).
There exists a family of MDPs with $\mathrm{}\mathrm{A}\mathrm{}\mathrm{=}\mathrm{2}$ and a feature extractor $\varphi $ that satisfy Assumption D.1, such that any algorithm that returns a $\mathrm{1}\mathrm{/}\mathrm{2}$optimal policy with probability $\mathrm{0.9}$ needs to sample $\mathrm{\Omega}\mathit{}\mathrm{\left(}\mathrm{min}\mathit{}\mathrm{\{}\mathrm{}\mathrm{S}\mathrm{}\mathrm{,}{\mathrm{2}}^{H}\mathrm{,}\mathrm{exp}\mathit{}\mathrm{(}d\mathit{}{\delta}^{\mathrm{2}}\mathrm{/}\mathrm{16}\mathrm{)}\mathrm{\}}\mathrm{\right)}$ trajectories.
Proof of Theorem D.1.
We use the same construction in the proof of Theorem 4.1. Note we just need to verify that the construction satisfies Assumption D.1. By construction, for all $h\in \{1,2,\mathrm{\dots},H1\}$, for each state ${s}^{\prime}$ in level $h$, there exists a unique $(s,a)$ pair such that playing action $a$ transits $s$ to ${s}^{\prime}$, and we take $\psi ({s}^{\prime})=\varphi (s,a)$. We also take ${\beta}_{h}=0$ for $h\in \{0,1,\mathrm{\dots},H4,H3\}$ and ${\beta}_{H2}=\varphi (s,a)$ where $(s,a)$ is the unique pair with $R(s,a)=1$. Now, according to the design of $\varphi (\cdot ,\cdot )$ and Lemma 5.1, Assumption D.1 is satisfied. ∎