Abstract
A primary requirement for any reinforcement learning method is that it shouldproduce policies that improve upon the initial guess. In this work, we showthat the widely used Deep Q-Network (DQN) fails to satisfy this minimalcriterion -- even when it gets to see all possible states and actionsinfinitely often (a condition under which tabular Q-learning is guaranteed toconverge to the optimal Q-value function). Our specific contributions aretwofold. First, we numerically show that DQN often returns a policy thatperforms worse than the initial one. Second, we offer a theoretical explanationfor this phenomenon in linear DQN, a simplified version of DQN that uses linearfunction approximation in place of neural networks while retaining the otherkey components such as $\epsilon$-greedy exploration, experience replay, andtarget network. Using tools from differential inclusion theory, we prove thatthe limit points of linear DQN correspond to fixed points of projected Bellmanoperators. Crucially, we show that these fixed points need not relate tooptimal -- or even near-optimal -- policies, thus explaining linear DQN'ssub-optimal behaviors. We also give a scenario where linear DQN alwaysidentifies the worst policy. Our work fills a longstanding gap in understandingthe convergence behaviors of Q-learning with function approximation and$\epsilon$-greedy exploration.