Abstract
We investigate the sample efficiency of reinforcement learning in a$\gamma$-discounted infinite-horizon Markov decision process (MDP) with statespace $\mathcal{S}$ and action space $\mathcal{A}$, assuming access to agenerative model. Despite a number of prior work tackling this problem, acomplete picture of the trade-offs between sample complexity and statisticalaccuracy is yet to be determined. In particular, prior results suffer from asample size barrier, in the sense that their claimed statistical guaranteeshold only when the sample size exceeds at least$\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^2}$ (up to some log factor). Thecurrent paper overcomes this barrier by certifying the minimax optimality ofmodel-based reinforcement learning as soon as the sample size exceeds the orderof $\frac{|\mathcal{S}||\mathcal{A}|}{1-\gamma}$ (modulo some log factor). Morespecifically, a perturbed model-based planning algorithm provably finds an$\varepsilon$-optimal policy with an order of $\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^3\varepsilon^2}\log\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)\varepsilon}$samples for any $\varepsilon \in (0, \frac{1}{1-\gamma}]$. Along the way, wederive improved (instance-dependent) guarantees for model-based policyevaluation. To the best of our knowledge, this work provides the firstminimax-optimal guarantee in a generative model that accommodates the entirerange of sample sizes (beyond which finding a meaningful policy is informationtheoretically impossible).