Better-than-Demonstrator Imitation Learning via Automatically-Ranked Demonstrations

Abstract

The performance of imitation learning is typically upper-bounded by theperformance of the demonstrator. While recent empirical results demonstratethat ranked demonstrations allow for better-than-demonstrator performance,preferences over demonstrations may be difficult to obtain, and little is knowntheoretically about when such methods can be expected to successfullyextrapolate beyond the performance of the demonstrator. To address theseissues, we first contribute a sufficient condition for better-than-demonstratorimitation learning and provide theoretical results showing why preferences overdemonstrations can better reduce reward function ambiguity when performinginverse reinforcement learning. Building on this theory, we introduceDisturbance-based Reward Extrapolation (D-REX), a ranking-based imitationlearning method that injects noise into a policy learned through behavioralcloning to automatically generate ranked demonstrations. These rankeddemonstrations are used to efficiently learn a reward function that can then beoptimized using reinforcement learning. We empirically validate our approach onsimulated robot and Atari imitation learning benchmarks and show that D-REXoutperforms standard imitation learning approaches and can significantlysurpass the performance of the demonstrator. D-REX is the first imitationlearning approach to achieve significant extrapolation beyond thedemonstrator's performance without additional side-information or supervision,such as rewards or human preferences. By generating rankings automatically, weshow that preference-based inverse reinforcement learning can be applied intraditional imitation learning settings where only unlabeled demonstrations areavailable.

Quick Read (beta)

loading the full paper ...