Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Target

Abstract

Multi-step methods such as Retrace($\lambda$) and $n$-step $Q$-learning havebecome a crucial component of modern deep reinforcement learning agents. Thesemethods are often evaluated as a part of bigger architectures and theirevaluations rarely include enough samples to draw statistically significantconclusions about their performance. This type of methodology makes itdifficult to understand how particular algorithmic details of multi-stepmethods influence learning. In this paper we combine the $n$-step action-valuealgorithms Retrace, $Q$-learning, Tree Backup, Sarsa, and $Q(\sigma)$ with anarchitecture analogous to DQN. We test the performance of all these algorithmsin the mountain car environment; this choice of environment allows for fastertraining times and larger sample sizes. We present statistical analyses on theeffects of the off-policy correction, the backup length parameter $n$, and theupdate frequency of the target network on the performance of these algorithms.Our results show that (1) using off-policy correction can have an adverseeffect on the performance of Sarsa and $Q(\sigma)$; (2) increasing the backuplength $n$ consistently improved performance across all the differentalgorithms; and (3) the performance of Sarsa and $Q$-learning was more robustto the effect of the target network update frequency than the performance ofTree Backup, $Q(\sigma)$, and Retrace in this particular task.

Quick Read (beta)

loading the full paper ...