Abstract
Inverse Reinforcement Learning (IRL) is a powerful framework for learningcomplex behaviors from expert demonstrations. However, it traditionallyrequires repeatedly solving a computationally expensive reinforcement learning(RL) problem in its inner loop. It is desirable to reduce the explorationburden by leveraging expert demonstrations in the inner-loop RL. As an example,recent work resets the learner to expert states in order to inform the learnerof high-reward expert states. However, such an approach is infeasible in thereal world. In this work, we consider an alternative approach to speeding upthe RL subroutine in IRL: \emph{pessimism}, i.e., staying close to the expert'sdata distribution, instantiated via the use of offline RL algorithms. Weformalize a connection between offline RL and IRL, enabling us to use anarbitrary offline RL algorithm to improve the sample efficiency of IRL. Wevalidate our theory experimentally by demonstrating a strong correlationbetween the efficacy of an offline RL algorithm and how well it works as partof an IRL procedure. By using a strong offline RL algorithm as part of an IRLprocedure, we are able to find policies that match expert performancesignificantly more efficiently than the prior art.