DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction

Abstract

Deep reinforcement learning can learn effective policies for a wide range oftasks, but is notoriously difficult to use due to instability and sensitivityto hyperparameters. The reasons for this remain unclear. When using standardsupervised methods (e.g., for bandits), on-policy data collection provides"hard negatives" that correct the model in precisely those states and actionsthat the policy is likely to visit. We call this phenomenon "correctivefeedback." We show that bootstrapping-based Q-learning algorithms do notnecessarily benefit from this corrective feedback, and training on theexperience collected by the algorithm is not sufficient to correct errors inthe Q-function. In fact, Q-learning and related methods can exhibitpathological interactions between the distribution of experience collected bythe agent and the policy induced by training on that experience, leading topotential instability, sub-optimal convergence, and poor results when learningfrom noisy, sparse or delayed rewards. We demonstrate the existence of thisproblem, both theoretically and empirically. We then show that a specificcorrection to the data distribution can mitigate this issue. Based on theseobservations, we propose a new algorithm, DisCor, which computes anapproximation to this optimal distribution and uses it to re-weight thetransitions used for training, resulting in substantial improvements in a rangeof challenging RL settings, such as multi-task learning and learning from noisyreward signals. Blog post presenting a summary of this work is available at:https://bair.berkeley.edu/blog/2020/03/16/discor/.

Quick Read (beta)

loading the full paper ...