Offline Experience Replay for Continual Offline Reinforcement Learning

Abstract

The capability of continuously learning new skills via a sequence ofpre-collected offline datasets is desired for an agent. However, consecutivelylearning a sequence of offline tasks likely leads to the catastrophicforgetting issue under resource-limited scenarios. In this paper, we formulatea new setting, continual offline reinforcement learning (CORL), where an agentlearns a sequence of offline reinforcement learning tasks and pursues goodperformance on all learned tasks with a small replay buffer without exploringany of the environments of all the sequential tasks. For consistently learningon all sequential tasks, an agent requires acquiring new knowledge andmeanwhile preserving old knowledge in an offline manner. To this end, weintroduced continual learning algorithms and experimentally found experiencereplay (ER) to be the most suitable algorithm for the CORL problem. However, weobserve that introducing ER into CORL encounters a new distribution shiftproblem: the mismatch between the experiences in the replay buffer andtrajectories from the learned policy. To address such an issue, we propose anew model-based experience selection (MBES) scheme to build the replay buffer,where a transition model is learned to approximate the state distribution. Thismodel is used to bridge the distribution bias between the replay buffer and thelearned model by filtering the data from offline data that most closelyresembles the learned model for storage. Moreover, in order to enhance theability on learning new tasks, we retrofit the experience replay method with anew dual behavior cloning (DBC) architecture to avoid the disturbance ofbehavior-cloning loss on the Q-learning process. In general, we call ouralgorithm offline experience replay (OER). Extensive experiments demonstratethat our OER method outperforms SOTA baselines in widely-used Mujocoenvironments.

Quick Read (beta)

loading the full paper ...