GDI: Rethinking What Makes Reinforcement Learning Different From Supervised Learning

Abstract

Deep Q Network (DQN) firstly kicked the door of deep reinforcement learning(DRL) via combining deep learning (DL) with reinforcement learning (RL), whichhas noticed that the distribution of the acquired data would change during thetraining process. DQN found this property might cause instability for training,so it proposed effective methods to handle the downside of the property.Instead of focusing on the unfavourable aspects, we find it critical for RL toease the gap between the estimated data distribution and the ground truth datadistribution while supervised learning (SL) fails to do so. From this newperspective, we extend the basic paradigm of RL called the Generalized PolicyIteration (GPI) into a more generalized version, which is called theGeneralized Data Distribution Iteration (GDI). We see massive RL algorithms andtechniques can be unified into the GDI paradigm, which can be considered as oneof the special cases of GDI. We provide theoretical proof of why GDI is betterthan GPI and how it works. Several practical algorithms based on GDI have beenproposed to verify the effectiveness and extensiveness of it. Empiricalexperiments prove our state-of-the-art (SOTA) performance on Arcade LearningEnvironment (ALE), wherein our algorithm has achieved 9620.98% mean humannormalized score (HNS), 1146.39% median HNS and 22 human world recordbreakthroughs (HWRB) using only 200M training frames. Our work aims to lead theRL research to step into the journey of conquering the human world records andseek real superhuman agents on both performance and efficiency.

Quick Read (beta)

loading the full paper ...