Plan To Predict: Learning an Uncertainty-Foreseeing Model for Model-Based Reinforcement Learning

Abstract

In Model-based Reinforcement Learning (MBRL), model learning is criticalsince an inaccurate model can bias policy learning via generating misleadingsamples. However, learning an accurate model can be difficult since the policyis continually updated and the induced distribution over visited states usedfor model learning shifts accordingly. Prior methods alleviate this issue byquantifying the uncertainty of model-generated samples. However, these methodsonly quantify the uncertainty passively after the samples were generated,rather than foreseeing the uncertainty before model trajectories fall intothose highly uncertain regions. The resulting low-quality samples can induceunstable learning targets and hinder the optimization of the policy. Moreover,while being learned to minimize one-step prediction errors, the model isgenerally used to predict for multiple steps, leading to a mismatch between theobjectives of model learning and model usage. To this end, we propose\emph{Plan To Predict} (P2P), an MBRL framework that treats the model rolloutprocess as a sequential decision making problem by reversely considering themodel as a decision maker and the current policy as the dynamics. In this way,the model can quickly adapt to the current policy and foresee the multi-stepfuture uncertainty when generating trajectories. Theoretically, we show thatthe performance of P2P can be guaranteed by approximately optimizing a lowerbound of the true environment return. Empirical results demonstrate that P2Pachieves state-of-the-art performance on several challenging benchmark tasks.

Quick Read (beta)

loading the full paper ...