Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

Abstract

Modern tasks in reinforcement learning have large state and action spaces. Todeal with them efficiently, one often uses predefined feature mapping torepresent states and actions in a low dimensional space. In this paper, westudy reinforcement learning for discounted Markov Decision Processes (MDPs),where the transition kernel can be parameterized as a linear function ofcertain feature mapping. We propose a novel algorithm which makes use of thefeature mapping and obtains a $\tilde O(d\sqrt{T}/(1-\gamma)^2)$ regret, where$d$ is the dimension of the feature space, $T$ is the time horizon and $\gamma$is the discount factor of the MDP. To the best of our knowledge, this is thefirst polynomial regret bound without accessing to a generative model or makingstrong assumptions such as ergodicity of the MDP. By constructing a specialclass of MDPs, we also show that for any algorithms, the regret is lowerbounded by $\Omega(d\sqrt{T}/(1-\gamma)^{1.5})$. Our upper and lower boundresults together suggest that the proposed reinforcement learning algorithm isnear-optimal up to a $(1-\gamma)^{-0.5}$ factor.

Quick Read (beta)

loading the full paper ...