On the Mistaken Assumption of Interchangeable Deep Reinforcement Learning Implementations

Abstract

Deep Reinforcement Learning (DRL) is a paradigm of artificial intelligencewhere an agent uses a neural network to learn which actions to take in a givenenvironment. DRL has recently gained traction from being able to solve complexenvironments like driving simulators, 3D robotic control, andmultiplayer-online-battle-arena video games. Numerous implementations of thestate-of-the-art algorithms responsible for training these agents, like theDeep Q-Network (DQN) and Proximal Policy Optimization (PPO) algorithms,currently exist. However, studies make the mistake of assuming implementationsof the same algorithm to be consistent and thus, interchangeable. In thispaper, through a differential testing lens, we present the results of studyingthe extent of implementation inconsistencies, their effect on theimplementations' performance, as well as their impact on the conclusions ofprior studies under the assumption of interchangeable implementations. Theoutcomes of our differential tests showed significant discrepancies between thetested algorithm implementations, indicating that they are not interchangeable.In particular, out of the five PPO implementations tested on 56 games, threeimplementations achieved superhuman performance for 50% of their total trialswhile the other two implementations only achieved superhuman performance forless than 15% of their total trials. As part of a meticulous manual analysis ofthe implementations' source code, we analyzed implementation discrepancies anddetermined that code-level inconsistencies primarily caused thesediscrepancies. Lastly, we replicated a study and showed that this assumption ofimplementation interchangeability was sufficient to flip experiment outcomes.Therefore, this calls for a shift in how implementations are being used.

Quick Read (beta)

loading the full paper ...