Abstract
We introduce and analyze a form of variance-reduced $Q$-learning. For$\gamma$-discounted MDPs with finite state space $\mathcal{X}$ and action space$\mathcal{U}$, we prove that it yields an $\epsilon$-accurate estimate of theoptimal $Q$-function in the $\ell_\infty$-norm using $\mathcal{O}\left(\left(\frac{D}{ \epsilon^2 (1-\gamma)^3} \right) \; \log \left(\frac{D}{(1-\gamma)} \right) \right)$ samples, where $D = |\mathcal{X}| \times|\mathcal{U}|$. This guarantee matches known minimax lower bounds up to alogarithmic factor in the discount complexity, and is the first form ofmodel-free $Q$-learning proven to achieve the worst-case optimal cubic scalingin the discount complexity parameter $1/(1-\gamma)$ accompanied by optimallinear scaling in the state and action space sizes. By contrast, our past workshows that ordinary $Q$-learning has worst-case quartic scaling in the discountcomplexity.