Abstract
Model-based reinforcement learning approaches carry the promise of being dataefficient. However, due to challenges in learning dynamics models thatsufficiently match the real-world dynamics, they struggle to achieve the sameasymptotic performance as model-free methods. We propose Model-BasedMeta-Policy-Optimization (MB-MPO), an approach that foregoes the strongreliance on accurate learned dynamics models. Using an ensemble of learneddynamic models, MB-MPO meta-learns a policy that can quickly adapt to any modelin the ensemble with one policy gradient step. This steers the meta-policytowards internalizing consistent dynamics predictions among the ensemble whileshifting the burden of behaving optimally w.r.t. the model discrepanciestowards the adaptation step. Our experiments show that MB-MPO is more robust tomodel imperfections than previous model-based approaches. Finally, wedemonstrate that our approach is able to match the asymptotic performance ofmodel-free methods while requiring significantly less experience.