Towards Large-Scale In-Context Reinforcement Learning by Meta-Training in Randomized Worlds

Abstract

In-Context Reinforcement Learning (ICRL) enables agents to learnautomatically and on-the-fly from their interactive experiences. However, amajor challenge in scaling up ICRL is the lack of scalable task collections. Toaddress this, we propose the procedurally generated tabular Markov DecisionProcesses, named AnyMDP. Through a carefully designed randomization process,AnyMDP is capable of generating high-quality tasks on a large scale whilemaintaining relatively low structural biases. To facilitate efficientmeta-training at scale, we further introduce decoupled policy distillation andinduce prior information in the ICRL framework. Our results demonstrate that,with a sufficiently large scale of AnyMDP tasks, the proposed model cangeneralize to tasks that were not considered in the training set throughversatile in-context learning paradigms. The scalable task set provided byAnyMDP also enables a more thorough empirical investigation of the relationshipbetween data distribution and ICRL performance. We further show that thegeneralization of ICRL potentially comes at the cost of increased taskdiversity and longer adaptation periods. This finding carries criticalimplications for scaling robust ICRL capabilities, highlighting the necessityof diverse and extensive task design, and prioritizing asymptotic performanceover few-shot adaptation.

Quick Read (beta)

loading the full paper ...