Overcoming Model Bias for Robust Offline Deep Reinforcement Learning

Abstract

State-of-the-art reinforcement learning algorithms mostly rely on beingallowed to directly interact with their environment to collect millions ofobservations. This makes it hard to transfer their success to industrialcontrol problems, where simulations are often very costly or do not exist, andexploring in the real environment can potentially lead to catastrophic events.Recently developed, model-free, offline algorithms, can learn from a singledataset by mitigating extrapolation error in value functions. However, therobustness of the training process is still comparatively low, a problem knownfrom methods using value functions. To improve robustness and stability of thelearning process, we use dynamics models to assess policy performance insteadof value functions, resulting in MOOSE (MOdel-based Offline policy Search withEnsembles), an algorithm which ensures low model bias by keeping the policywithin the support of the data. We compare MOOSE with state-of-the-artmodel-free, offline RL algorithms BEAR and BCQ on the Industrial Benchmark andMujoco continuous control tasks in terms of robust performance, and find thatMOOSE outperforms its model-free counterparts in almost all considered cases,often even by far.

Quick Read (beta)

loading the full paper ...