Abstract
Learning complex policies with Reinforcement Learning (RL) is often hinderedby instability and slow convergence, a problem exacerbated by the difficulty ofreward engineering. Imitation Learning (IL) from expert demonstrations bypassesthis reliance on rewards. However, state-of-the-art IL methods, exemplified byGenerative Adversarial Imitation Learning (GAIL)Ho et. al, suffer from severesample inefficiency. This is a direct consequence of their foundationalon-policy algorithms, such as TRPO Schulman et.al. In this work, we introducean adversarial imitation learning algorithm that incorporates off-policylearning to improve sample efficiency. By combining an off-policy frameworkwith auxiliary techniques specifically, double Q network based stabilizationand value learning without reward function inference we demonstrate a reductionin the samples required to robustly match expert behavior.