MOReL : Model-Based Offline Reinforcement Learning

Abstract

In offline reinforcement learning (RL), the goal is to learn a successfulpolicy using only a dataset of historical interactions with the environment,without any additional online interactions. This serves as an extreme test foran agent's ability to effectively use historical data, which is critical forefficient RL. Prior work in offline RL has been confined almost exclusively tomodel-free RL approaches. In this work, we present MOReL, an algorithmicframework for model-based RL in the offline setting. This framework consists oftwo steps: (a) learning a pessimistic MDP model using the offline dataset; (b)learning a near-optimal policy in the learned pessimistic MDP. The constructionof the pessimistic MDP is such that for any policy, the performance in the realenvironment is lower bounded by the performance in the pessimistic MDP. Thisenables the pessimistic MDP to serve as a good surrogate for the purposes ofpolicy evaluation and learning. Overall, MOReL is amenable to detailedtheoretical analysis, enables easy and transparent design of practicalalgorithms, and leads to state-of-the-art results on widely studied offline RLbenchmark tasks.

Quick Read (beta)

loading the full paper ...