Backward Imitation and Forward Reinforcement Learning via Bi-directional Model Rollouts

Abstract

Traditional model-based reinforcement learning (RL) methods generate forwardrollout traces using the learnt dynamics model to reduce interactions with thereal environment. The recent model-based RL method considers the way to learn abackward model that specifies the conditional probability of the previous stategiven the previous action and the current state to additionally generatebackward rollout trajectories. However, in this type of model-based method, thesamples derived from backward rollouts and those from forward rollouts aresimply aggregated together to optimize the policy via the model-free RLalgorithm, which may decrease both the sample efficiency and the convergencerate. This is because such an approach ignores the fact that backward rollouttraces are often generated starting from some high-value states and arecertainly more instructive for the agent to improve the behavior. In thispaper, we propose the backward imitation and forward reinforcement learning(BIFRL) framework where the agent treats backward rollout traces as expertdemonstrations for the imitation of excellent behaviors, and then collectsforward rollout transitions for policy reinforcement. Consequently, BIFRLempowers the agent to both reach to and explore from high-value states in amore efficient manner, and further reduces the real interactions, making itpotentially more suitable for real-robot learning. Moreover, avalue-regularized generative adversarial network is introduced to augment thevaluable states which are infrequently received by the agent. Theoretically, weprovide the condition where BIFRL is superior to the baseline methods.Experimentally, we demonstrate that BIFRL acquires the better sample efficiencyand produces the competitive asymptotic performance on various MuJoColocomotion tasks compared against state-of-the-art model-based methods.

Quick Read (beta)

loading the full paper ...