The Virtues of Pessimism in Inverse Reinforcement Learning

Abstract

Inverse Reinforcement Learning (IRL) is a powerful framework for learningcomplex behaviors from expert demonstrations. However, it traditionallyrequires repeatedly solving a computationally expensive reinforcement learning(RL) problem in its inner loop. It is desirable to reduce the explorationburden by leveraging expert demonstrations in the inner-loop RL. As an example,recent work resets the learner to expert states in order to inform the learnerof high-reward expert states. However, such an approach is infeasible in thereal world. In this work, we consider an alternative approach to speeding upthe RL subroutine in IRL: \emph{pessimism}, i.e., staying close to the expert'sdata distribution, instantiated via the use of offline RL algorithms. Weformalize a connection between offline RL and IRL, enabling us to use anarbitrary offline RL algorithm to improve the sample efficiency of IRL. Wevalidate our theory experimentally by demonstrating a strong correlationbetween the efficacy of an offline RL algorithm and how well it works as partof an IRL procedure. By using a strong offline RL algorithm as part of an IRLprocedure, we are able to find policies that match expert performancesignificantly more efficiently than the prior art.

Quick Read (beta)

loading the full paper ...