Abstract
Provably efficient Model-Based Reinforcement Learning (MBRL) based onoptimism or posterior sampling (PSRL) is ensured to attain the globaloptimality asymptotically by introducing the complexity measure of the model.However, the complexity might grow exponentially for the simplest nonlinearmodels, where global convergence is impossible within finite iterations. Whenthe model suffers a large generalization error, which is quantitativelymeasured by the model complexity, the uncertainty can be large. The sampledmodel that current policy is greedily optimized upon will thus be unsettled,resulting in aggressive policy updates and over-exploration. In this work, wepropose Conservative Dual Policy Optimization (CDPO) that involves aReferential Update and a Conservative Update. The policy is first optimizedunder a reference model, which imitates the mechanism of PSRL while offeringmore stability. A conservative range of randomness is guaranteed by maximizingthe expectation of model value. Without harmful sampling procedures, CDPO canstill achieve the same regret as PSRL. More importantly, CDPO enjoys monotonicpolicy improvement and global optimality simultaneously. Empirical results alsovalidate the exploration efficiency of CDPO.