Abstract
Off-policy reinforcement learning (RL) using a fixed offline dataset oflogged interactions is an important consideration in real world applications.This paper studies offline RL using the DQN replay dataset comprising theentire replay experience of a DQN agent on 60 Atari 2600 games. We demonstratethat recent off-policy deep RL algorithms, even when trained solely on thisreplay dataset, outperform the fully trained DQN agent. To enhancegeneralization in the offline setting, we present Random Ensemble Mixture(REM), a robust Q-learning algorithm that enforces optimal Bellman consistencyon random convex combinations of multiple Q-value estimates. Offline REMtrained on the DQN replay dataset surpasses strong RL baselines. The resultshere present an optimistic view that robust RL algorithms trained onsufficiently large and diverse offline datasets can lead to high qualitypolicies. The DQN replay dataset can serve as an offline RL benchmark and isopen-sourced.