Reinforcement Learning from Imperfect Demonstrations

Abstract

Robust real-world learning should benefit from both demonstrations andinteractions with the environment. Current approaches to learning fromdemonstration and reward perform supervised learning on expert demonstrationdata and use reinforcement learning to further improve performance based on thereward received from the environment. These tasks have divergent losses whichare difficult to jointly optimize and such methods can be very sensitive tonoisy demonstrations. We propose a unified reinforcement learning algorithm,Normalized Actor-Critic (NAC), that effectively normalizes the Q-function,reducing the Q-values of actions unseen in the demonstration data. NAC learnsan initial policy network from demonstrations and refines the policy in theenvironment, surpassing the demonstrator's performance. Crucially, bothlearning from demonstration and interactive refinement use the same objective,unlike prior approaches that combine distinct supervised and reinforcementlosses. This makes NAC robust to suboptimal demonstration data since the methodis not forced to mimic all of the examples in the dataset. We show that ourunified reinforcement learning algorithm can learn robustly and outperformexisting baselines when evaluated on several realistic driving games.

Quick Read (beta)

loading the full paper ...