A Joint Imitation-Reinforcement Learning Framework for Reduced Baseline Regret

Abstract

In various control task domains, existing controllers provide a baselinelevel of performance that -- though possibly suboptimal -- should bemaintained. Reinforcement learning (RL) algorithms that rely on extensiveexploration of the state and action space can be used to optimize a controlpolicy. However, fully exploratory RL algorithms may decrease performance belowa baseline level during training. In this paper, we address the issue of onlineoptimization of a control policy while minimizing regret w.r.t a baselinepolicy performance. We present a joint imitation-reinforcement learningframework, denoted JIRL. The learning process in JIRL assumes the availabilityof a baseline policy and is designed with two objectives in mind \textbf{(a)}leveraging the baseline's online demonstrations to minimize the regret w.r.tthe baseline policy during training, and \textbf{(b)} eventually surpassing thebaseline performance. JIRL addresses these objectives by initially learning toimitate the baseline policy and gradually shifting control from the baseline toan RL agent. Experimental results show that JIRL effectively accomplishes theaforementioned objectives in several, continuous action-space domains. Theresults demonstrate that JIRL is comparable to a state-of-the-art algorithm inits final performance while incurring significantly lower baseline regretduring training in all of the presented domains. Moreover, the results show areduction factor of up to $21$ in baseline regret over a state-of-the-artbaseline regret minimization approach.

Quick Read (beta)

loading the full paper ...