PsiPhi-Learning: Reinforcement Learning with Demonstrations using Successor Features and Inverse Temporal Difference Learning

Abstract

We study reinforcement learning (RL) with no-reward demonstrations, a settingin which an RL agent has access to additional data from the interaction ofother agents with the same environment. However, it has no access to therewards or goals of these agents, and their objectives and levels of expertisemay vary widely. These assumptions are common in multi-agent settings, such asautonomous driving. To effectively use this data, we turn to the framework ofsuccessor features. This allows us to disentangle shared features and dynamicsof the environment from agent-specific rewards and policies. We propose amulti-task inverse reinforcement learning (IRL) algorithm, called \emph{inversetemporal difference learning} (ITD), that learns shared state features,alongside per-agent successor features and preference vectors, purely fromdemonstrations without reward labels. We further show how to seamlesslyintegrate ITD with learning from online environment interactions, arriving at anovel algorithm for reinforcement learning with demonstrations, called $\Psi\Phi$-learning (pronounced `Sci-Fi'). We provide empirical evidence for theeffectiveness of $\Psi \Phi$-learning as a method for improving RL, IRL,imitation, and few-shot transfer, and derive worst-case bounds for itsperformance in zero-shot transfer to new tasks.

Quick Read (beta)

loading the full paper ...