Adaptive Policy Learning for Offline-to-Online Reinforcement Learning

Abstract

Conventional reinforcement learning (RL) needs an environment to collectfresh data, which is impractical when online interactions are costly. OfflineRL provides an alternative solution by directly learning from the previouslycollected dataset. However, it will yield unsatisfactory performance if thequality of the offline datasets is poor. In this paper, we consider anoffline-to-online setting where the agent is first learned from the offlinedataset and then trained online, and propose a framework called Adaptive PolicyLearning for effectively taking advantage of offline and online data.Specifically, we explicitly consider the difference between the online andoffline data and apply an adaptive update scheme accordingly, that is, apessimistic update strategy for the offline dataset and an optimistic/greedyupdate scheme for the online dataset. Such a simple and effective methodprovides a way to mix the offline and online RL and achieve the best of bothworlds. We further provide two detailed algorithms for implementing theframework through embedding value or policy-based RL algorithms into it.Finally, we conduct extensive experiments on popular continuous control tasks,and results show that our algorithm can learn the expert policy with highsample efficiency even when the quality of offline dataset is poor, e.g.,random dataset.

Quick Read (beta)

loading the full paper ...