Abstract
Meta-reinforcement learning (Meta-RL) has attracted attention due to itscapability to enhance reinforcement learning (RL) algorithms, in terms of dataefficiency and generalizability. In this paper, we develop a bileveloptimization framework for meta-RL (BO-MRL) to learn the meta-prior fortask-specific policy adaptation, which implements multiple-step policyoptimization on one-time data collection. Beyond existing meta-RL analyses, weprovide upper bounds of the expected optimality gap over the task distribution.This metric measures the distance of the policy adaptation from the learnedmeta-prior to the task-specific optimum, and quantifies the model'sgeneralizability to the task distribution. We empirically validate thecorrectness of the derived upper bounds and demonstrate the superioreffectiveness of the proposed algorithm over benchmarks.