Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations

Abstract

A critical flaw of existing inverse reinforcement learning (IRL) methods istheir inability to significantly outperform the demonstrator. This is becauseIRL typically seeks a reward function that makes the demonstrator appearnear-optimal, rather than inferring the underlying intentions of thedemonstrator that may have been poorly executed in practice. In this paper, weintroduce a novel reward-learning-from-observation algorithm, Trajectory-rankedReward EXtrapolation (T-REX), that extrapolates beyond a set of (approximately)ranked demonstrations in order to infer high-quality reward functions from aset of potentially poor demonstrations. When combined with deep reinforcementlearning, T-REX outperforms state-of-the-art imitation learning and IRL methodson multiple Atari and MuJoCo benchmark tasks and achieves performance that isoften more than twice the performance of the best demonstration. We alsodemonstrate that T-REX is robust to ranking noise and can accuratelyextrapolate intention by simply watching a learner noisily improve at a taskover time.

Quick Read (beta)

loading the full paper ...