### Abstract

Modern tasks in reinforcement learning have large state and action spaces. Todeal with them efficiently, one often uses predefined feature mapping torepresent states and actions in a low dimensional space. In this paper, westudy reinforcement learning for discounted Markov Decision Processes (MDPs),where the transition kernel can be parameterized as a linear function ofcertain feature mapping. We propose a novel algorithm which makes use of thefeature mapping and obtains a $\tilde O(d\sqrt{T}/(1-\gamma)^2)$ regret, where$d$ is the dimension of the feature space, $T$ is the time horizon and $\gamma$is the discount factor of the MDP. To the best of our knowledge, this is thefirst polynomial regret bound without accessing to a generative model or makingstrong assumptions such as ergodicity of the MDP. By constructing a specialclass of MDPs, we also show that for any algorithms, the regret is lowerbounded by $\Omega(d\sqrt{T}/(1-\gamma)^{1.5})$. Our upper and lower boundresults together suggest that the proposed reinforcement learning algorithm isnear-optimal up to a $(1-\gamma)^{-0.5}$ factor.