Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

Abstract

Modern tasks in reinforcement learning are always with large state and actionspaces. To deal with them efficiently, one often uses predefined featuremapping to represents states and actions in a low dimensional space. In thispaper, we study reinforcement learning with feature mapping for discountedMarkov Decision Processes (MDPs). We propose a novel algorithm which makes useof the feature mapping and obtains a $\tilde O(d\sqrt{T}/(1-\gamma)^2)$ regret,where $d$ is the dimension of the feature space, $T$ is the time horizon and$\gamma$ is the discount factor of the MDP. To the best of our knowledge, thisis the first polynomial regret bound without accessing to a generative model ormaking strong assumptions such as ergodicity of the MDP. By constructing aspecial class of MDPs, we also show that for any algorithms, the regret islower bounded by $\Omega(d\sqrt{T}/(1-\gamma)^{1.5})$. Our upper and lowerbound results together suggest that the proposed reinforcement learningalgorithm is near-optimal up to a $(1-\gamma)^{-0.5}$ factor.

Quick Read (beta)

loading the full paper ...