### Abstract

Model-free reinforcement learning (RL) algorithms, such as Q-learning,directly parameterize and update value functions or policies without explicitlymodeling the environment. They are typically simpler, more flexible to use, andthus more prevalent in modern deep RL than model-based approaches. However,empirical work has suggested that model-free algorithms may require moresamples to learn [Deisenroth and Rasmussen 2011, Schulman et al. 2015]. Thetheoretical question of "whether model-free algorithms can be made sampleefficient" is one of the most fundamental questions in RL, and remains unsolvedeven in the basic scenario with finitely many states and actions. We prove that, in an episodic MDP setting, Q-learning with UCB explorationachieves regret $\tilde{O}(\sqrt{H^3 SAT})$, where $S$ and $A$ are the numbersof states and actions, $H$ is the number of steps per episode, and $T$ is thetotal number of steps. This sample efficiency matches the optimal regret thatcan be achieved by any model-based approach, up to a single $\sqrt{H}$ factor.To the best of our knowledge, this is the first analysis in the model-freesetting that establishes $\sqrt{T}$ regret without requiring access to a"simulator."

### Introduction (beta)

None

### Conclusion (beta)

None