Abstract
Widely-used deep reinforcement learning algorithms have been shown to fail inthe batch setting--learning from a fixed data set without interaction with theenvironment. Following this result, there have been several papers showingreasonable performances under a variety of environments and batch settings. Inthis paper, we benchmark the performance of recent off-policy and batchreinforcement learning algorithms under unified settings on the Atari domain,with data generated by a single partially-trained behavioral policy. We findthat under these conditions, many of these algorithms underperform DQN trainedonline with the same amount of data, as well as the partially-trainedbehavioral policy. To introduce a strong baseline, we adapt theBatch-Constrained Q-learning algorithm to a discrete-action setting, and showit outperforms all existing algorithms at this task.