Abstract
Language models trained on diverse datasets unlock generalization byin-context learning. Reinforcement Learning (RL) policies can achieve a similareffect by meta-learning within the memory of a sequence model. However, meta-RLresearch primarily focuses on adapting to minor variations of a single task. Itis difficult to scale towards more general behavior without confrontingchallenges in multi-task optimization, and few solutions are compatible withmeta-RL's goal of learning from large training sets of unlabeled tasks. Toaddress this challenge, we revisit the idea that multi-task RL is bottleneckedby imbalanced training losses created by uneven return scales across differenttasks. We build upon recent advancements in Transformer-based (in-context)meta-RL and evaluate a simple yet scalable solution where both an agent's actorand critic objectives are converted to classification terms that decoupleoptimization from the current scale of returns. Large-scale comparisons inMeta-World ML45, Multi-Game Procgen, Multi-Task POPGym, Multi-Game Atari, andBabyAI find that this design unlocks significant progress in online multi-taskadaptation and memory problems without explicit task labels.