Nearly Horizon-Free Offline Reinforcement Learning

Abstract

We revisit offline reinforcement learning on episodic time-homogeneoustabular Markov Decision Processes with $S$ states, $A$ actions and planninghorizon $H$. Given the collected $N$ episodes data with minimum cumulativereaching probability $d_m$, we obtain the first set of nearly $H$-free samplecomplexity bounds for evaluation and planning using the empirical MDPs: 1.Forthe offline evaluation, we obtain an $\tilde{O}\left(\sqrt{\frac{1}{Nd_m}}\right)$ error rate, which matches the lower bound and does not have additionaldependency on $\poly\left(S,A\right)$ in higher-order term, that is differentfrom previous works~\citep{yin2020near,yin2020asymptotically}. 2.For theoffline policy optimization, we obtain an $\tilde{O}\left(\sqrt{\frac{1}{Nd_m}}+ \frac{S}{Nd_m}\right)$ error rate, improving upon the best known result by\cite{cui2020plug}, which has additional $H$ and $S$ factors in the main term.Furthermore, this bound approaches the$\Omega\left(\sqrt{\frac{1}{Nd_m}}\right)$ lower bound up to logarithmicfactors and a high-order term. To the best of our knowledge, these are thefirst set of nearly horizon-free bounds in offline reinforcement learning.

Quick Read (beta)

loading the full paper ...