Online and Offline Reinforcement Learning by Planning with a Learned Model

Abstract

Learning efficiently from small amounts of data has long been the focus ofmodel-based reinforcement learning, both for the online case when interactingwith the environment and the offline case when learning from a fixed dataset.However, to date no single unified algorithm could demonstrate state-of-the-artresults in both settings. In this work, we describe the Reanalyse algorithmwhich uses model-based policy and value improvement operators to compute newimproved training targets on existing data points, allowing efficient learningfor data budgets varying by several orders of magnitude. We further show thatReanalyse can also be used to learn entirely from demonstrations without anyenvironment interactions, as in the case of offline Reinforcement Learning(offline RL). Combining Reanalyse with the MuZero algorithm, we introduceMuZero Unplugged, a single unified algorithm for any data budget, includingoffline RL. In contrast to previous work, our algorithm does not require anyspecial adaptations for the off-policy or offline RL settings. MuZero Unpluggedsets new state-of-the-art results in the RL Unplugged offline RL benchmark aswell as in the online RL benchmark of Atari in the standard 200 million framesetting.

Quick Read (beta)

loading the full paper ...