Is Deep Reinforcement Learning Really Superhuman on Atari? Leveling the playing field

Abstract

Consistent and reproducible evaluation of Deep Reinforcement Learning (DRL)is not straightforward. In the Arcade Learning Environment (ALE), small changesin environment parameters such as stochasticity or the maximum allowed playtime can lead to very different performance. In this work, we discuss thedifficulties of comparing different agents trained on ALE. In order to take astep further towards reproducible and comparable DRL, we introduce SABER, aStandardized Atari BEnchmark for general Reinforcement learning algorithms. Ourmethodology extends previous recommendations and contains a complete set ofenvironment parameters as well as train and test procedures. We then use SABERto evaluate the current state of the art, Rainbow. Furthermore, we introduce ahuman world records baseline, and argue that previous claims of expert orsuperhuman performance of DRL might not be accurate. Finally, we proposeRainbow-IQN by extending Rainbow with Implicit Quantile Networks (IQN) leadingto new state-of-the-art performance. Source code is available forreproducibility.

Quick Read (beta)

loading the full paper ...