Abstract
Modern tasks in reinforcement learning are always with large state and actionspaces. To deal with them efficiently, one often uses predefined featuremapping to represents states and actions in a low dimensional space. In thispaper, we study reinforcement learning with feature mapping for discountedMarkov Decision Processes (MDPs). We propose a novel algorithm which makes useof the feature mapping and obtains a $\tilde O(d\sqrt{T}/(1-\gamma)^2)$ regret,where $d$ is the dimension of the feature space, $T$ is the time horizon and$\gamma$ is the discount factor of the MDP. To the best of our knowledge, thisis the first polynomial regret bound without accessing to a generative model ormaking strong assumptions such as ergodicity of the MDP. By constructing aspecial class of MDPs, we also show that for any algorithms, the regret islower bounded by $\Omega(d\sqrt{T}/(1-\gamma)^{1.5})$. Our upper and lowerbound results together suggest that the proposed reinforcement learningalgorithm is near-optimal up to a $(1-\gamma)^{-0.5}$ factor.