Bayes Adaptive Monte Carlo Tree Search for Offline Model-based Reinforcement Learning

Abstract

Offline reinforcement learning (RL) is a powerful approach for data-drivendecision-making and control. Compared to model-free methods, offlinemodel-based reinforcement learning (MBRL) explicitly learns world models from astatic dataset and uses them as surrogate simulators, improving the dataefficiency and enabling the learned policy to potentially generalize beyond thedataset support. However, there could be various MDPs that behave identicallyon the offline dataset and so dealing with the uncertainty about the true MDPcan be challenging. In this paper, we propose modeling offline MBRL as a BayesAdaptive Markov Decision Process (BAMDP), which is a principled framework foraddressing model uncertainty. We further introduce a novel Bayes AdaptiveMonte-Carlo planning algorithm capable of solving BAMDPs in continuous stateand action spaces with stochastic transitions. This planning process is basedon Monte Carlo Tree Search and can be integrated into offline MBRL as a policyimprovement operator in policy iteration. Our ``RL + Search" framework followsin the footsteps of superhuman AIs like AlphaZero, improving on current offlineMBRL methods by incorporating more computation input. The proposed algorithmsignificantly outperforms state-of-the-art model-based and model-free offlineRL methods on twelve D4RL MuJoCo benchmark tasks and three target trackingtasks in a challenging, stochastic tokamak control simulator.

Quick Read (beta)

loading the full paper ...