Offline Meta-Reinforcement Learning with Advantage Weighting

Abstract

This paper introduces the offline meta-reinforcement learning (offlinemeta-RL) problem setting and proposes an algorithm that performs well in thissetting. Offline meta-RL is analogous to the widely successful supervisedlearning strategy of pre-training a model on a large batch of fixed,pre-collected data (possibly from various tasks) and fine-tuning the model to anew task with relatively little data. That is, in offline meta-RL, wemeta-train on fixed, pre-collected data from several tasks in order to adapt toa new task with a very small amount (less than 5 trajectories) of data from thenew task. By nature of being offline, algorithms for offline meta-RL canutilize the largest possible pool of training data available and eliminatepotentially unsafe or costly data collection during meta-training. This settinginherits the challenges of offline RL, but it differs significantly becauseoffline RL does not generally consider a) transfer to new tasks or b) limiteddata from the test task, both of which we face in offline meta-RL. Targetingthe offline meta-RL setting, we propose Meta-Actor Critic with AdvantageWeighting (MACAW), an optimization-based meta-learning algorithm that usessimple, supervised regression objectives for both the inner and outer loop ofmeta-training. On offline variants of common meta-RL benchmarks, we empiricallyfind that this approach enables fully offline meta-reinforcement learning andachieves notable gains over prior methods.

Quick Read (beta)

loading the full paper ...