Learn to Exceed: Stereo Inverse Reinforcement Learning with Concurrent Policy Optimization

Abstract

In this paper, we study the problem of obtaining a control policy that canmimic and then outperform expert demonstrations in Markov decision processeswhere the reward function is unknown to the learning agent. One main relevantapproach is the inverse reinforcement learning (IRL), which mainly focuses oninferring a reward function from expert demonstrations. The obtained controlpolicy by IRL and the associated algorithms, however, can hardly outperformexpert demonstrations. To overcome this limitation, we propose a novel methodthat enables the learning agent to outperform the demonstrator via a newconcurrent reward and action policy learning approach. In particular, we firstpropose a new stereo utility definition that aims to address the bias in theinterpretation of expert demonstrations. We then propose a loss function forthe learning agent to learn reward and action policies concurrently such thatthe learning agent can outperform expert demonstrations. The performance of theproposed method is first demonstrated in OpenAI environments. Further effortsare conducted to experimentally validate the proposed method via an indoordrone flight scenario.

Quick Read (beta)

loading the full paper ...