Abstract
Episodic reinforcement learning and contextual bandits are two widely studiedsequential decision-making problems. Episodic reinforcement learninggeneralizes contextual bandits and is often perceived to be more difficult dueto long planning horizon and unknown state-dependent transitions. The currentpaper shows that the long planning horizon and the unknown state-dependenttransitions (at most) pose little additional difficulty on sample complexity.We consider the episodic reinforcement learning with $S$ states, $A$ actions,planning horizon $H$, total reward bounded by $1$, and the agent plays for $K$episodes. We propose a new algorithm, \textbf{M}onotonic \textbf{V}alue\textbf{P}ropagation (MVP), which relies on a new Bernstein-type bonus. The newbonus only requires tweaking the \emph{constants} to ensure optimism and thusis significantly simpler than existing bonus constructions. We show MVP enjoysan $O\left(\left(\sqrt{SAK} + S^2A\right) \text{poly}\log\left(SAHK\right)\right)$ regret, approaching the$\Omega\left(\sqrt{SAK}\right)$ lower bound of \emph{contextual bandits}.Notably, this result 1) \emph{exponentially} improves the state-of-the-artpolynomial-time algorithms by Dann et al. [2019], Zanette et al. [2019] andZhang et al. [2020] in terms of the dependency on $H$, and 2)\emph{exponentially} improves the running time in [Wang et al. 2020] andsignificantly improves the dependency on $S$, $A$ and $K$ in sample complexity.