Striving for Simplicity in Off-policy Deep Reinforcement Learning

Abstract

Reflecting on the advances of off-policy deep reinforcement learning (RL)algorithms since the development of DQN in 2013, it is important to ask: arethe complexities of recent off-policy methods really necessary? In an attemptto isolate the contributions of various factors of variation in off-policy deepRL and to help design simpler algorithms, this paper investigates a set ofrelated questions: First, can effective policies be learned given only accessto logged offline experience? Second, how much of the benefits of recentdistributional RL algorithms is attributed to improvements in explorationversus exploitation behavior? Third, can simpler off-policy RL algorithmsoutperform distributional RL without learning explicit distributions overreturns? This paper uses a batch RL experimental setup on Atari 2600 games toinvestigate these questions. Unexpectedly, we find that batch RL algorithmstrained solely on logged experiences of a DQN agent are able to significantlyoutperform online DQN. Our experiments suggest that the benefits ofdistributional RL mainly stem from better exploitation. We present a simple andnovel variant of ensemble Q-learning called Random Ensemble Mixture (REM),which enforces optimal Bellman consistency on random convex combinations of theQ-heads of a multi-head Q-network. The batch REM agent trained offline on DQNdata outperforms the batch QR-DQN and online C51 algorithms.

Quick Read (beta)

loading the full paper ...