Online Target Q-learning with Reverse Experience Replay: Efficiently finding the Optimal Policy for Linear MDPs

Abstract

Q-learning is a popular Reinforcement Learning (RL) algorithm which is widelyused in practice with function approximation (Mnih et al., 2015). In contrast,existing theoretical results are pessimistic about Q-learning. For example,(Baird, 1995) shows that Q-learning does not converge even with linear functionapproximation for linear MDPs. Furthermore, even for tabular MDPs withsynchronous updates, Q-learning was shown to have sub-optimal sample complexity(Li et al., 2021;Azar et al., 2013). The goal of this work is to bridge the gapbetween practical success of Q-learning and the relatively pessimistictheoretical results. The starting point of our work is the observation that inpractice, Q-learning is used with two important modifications: (i) trainingwith two networks, called online network and target network simultaneously(online target learning, or OTL) , and (ii) experience replay (ER) (Mnih etal., 2015). While they have been observed to play a significant role in thepractical success of Q-learning, a thorough theoretical understanding of howthese two modifications improve the convergence behavior of Q-learning has beenmissing in literature. By carefully combining Q-learning with OTL and reverseexperience replay (RER) (a form of experience replay), we present novel methodsQ-Rex and Q-RexDaRe (Q-Rex + data reuse). We show that Q-Rex efficiently findsthe optimal policy for linear MDPs (or more generally for MDPs with zeroinherent Bellman error with linear approximation (ZIBEL)) and providenon-asymptotic bounds on sample complexity -- the first such result for aQ-learning method for this class of MDPs under standard assumptions.Furthermore, we demonstrate that Q-RexDaRe in fact achieves near optimal samplecomplexity in the tabular setting, improving upon the existing results forvanilla Q-learning.

Quick Read (beta)

loading the full paper ...