Efficient Model-Based Reinforcement Learning through Optimistic Policy Search and Planning

Abstract

Model-based reinforcement learning algorithms with probabilistic dynamicalmodels are amongst the most data-efficient learning methods. This is oftenattributed to their ability to distinguish between epistemic and aleatoricuncertainty. However, while most algorithms distinguish these two uncertaintiesfor {\em learning} the model, they ignore it when {\em optimizing} the policy.In this paper, we show that ignoring the epistemic uncertainty leads to greedyalgorithms that do not explore sufficiently. In turn, we propose a {\empractical optimistic-exploration algorithm} (\alg), which enlarges the inputspace with {\em hallucinated} inputs that can exert as much control as the {\emepistemic} uncertainty in the model affords. We analyze this setting andconstruct a general regret bound for well-calibrated models, which is provablysublinear in the case of Gaussian Process models. Based on this theoreticalfoundation, we show how optimistic exploration can be easily combined withstate-of-the-art reinforcement learning algorithms and different probabilisticmodels. Our experiments demonstrate that optimistic exploration significantlyspeeds up learning when there are penalties on actions, a setting that isnotoriously difficult for existing model-based reinforcement learningalgorithms.

Quick Read (beta)

loading the full paper ...