Revisiting Design Choices in Model-Based Offline Reinforcement Learning

Abstract

Offline reinforcement learning enables agents to leverage large pre-collecteddatasets of environment transitions to learn control policies, circumventingthe need for potentially expensive or unsafe online data collection.Significant progress has been made recently in offline model-basedreinforcement learning, approaches which leverage a learned dynamics model.This typically involves constructing a probabilistic model, and using the modeluncertainty to penalize rewards where there is insufficient data, solving for apessimistic MDP that lower bounds the true MDP. Existing methods, however,exhibit a breakdown between theory and practice, whereby pessimistic returnought to be bounded by the total variation distance of the model from the truedynamics, but is instead implemented through a penalty based on estimated modeluncertainty. This has spawned a variety of uncertainty heuristics, with littleto no comparison between differing approaches. In this paper, we compare theseheuristics, and design novel protocols to investigate their interaction withother hyperparameters, such as the number of models, or imaginary rollouthorizon. Using these insights, we show that selecting these key hyperparametersusing Bayesian Optimization produces superior configurations that are vastlydifferent to those currently used in existing hand-tuned state-of-the-artmethods, and result in drastically stronger performance.

Quick Read (beta)

loading the full paper ...