Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs

Abstract

We study the reinforcement learning problem for discounted Markov DecisionProcesses (MDPs) under the tabular setting. We propose a model-based algorithmnamed UCBVI-$\gamma$, which is based on the \emph{optimism in the face ofuncertainty principle} and the Bernstein-type bonus. We show thatUCBVI-$\gamma$ achieves an $\tilde{O}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$regret, where $S$ is the number of states, $A$ is the number of actions,$\gamma$ is the discount factor and $T$ is the number of steps. In addition, weconstruct a class of hard MDPs and show that for any algorithm, the expectedregret is at least $\tilde{\Omega}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$.Our upper bound matches the minimax lower bound up to logarithmic factors,which suggests that UCBVI-$\gamma$ is nearly minimax optimal for discountedMDPs.

Quick Read (beta)

loading the full paper ...