Abstract
Despite the great empirical success of deep reinforcement learning, itstheoretical foundation is less well understood. In this work, we make the firstattempt to theoretically understand the deep Q-network (DQN) algorithm (Mnih etal., 2015) from both algorithmic and statistical perspectives. In specific, wefocus on a slight simplification of DQN that fully captures its key features.Under mild assumptions, we establish the algorithmic and statistical rates ofconvergence for the action-value functions of the iterative policy sequenceobtained by DQN. In particular, the statistical error characterizes the biasand variance that arise from approximating the action-value function using deepneural network, while the algorithmic error converges to zero at a geometricrate. As a byproduct, our analysis provides justifications for the techniquesof experience replay and target network, which are crucial to the empiricalsuccess of DQN. Furthermore, as a simple extension of DQN, we propose theMinimax-DQN algorithm for zero-sum Markov game with two players. Borrowing theanalysis of DQN, we also quantify the difference between the policies obtainedby Minimax-DQN and the Nash equilibrium of the Markov game in terms of both thealgorithmic and statistical rates of convergence.