Scalable Multi-Task Reinforcement Learning for Generalizable Spatial Intelligence in Visuomotor Agents

Abstract

While Reinforcement Learning (RL) has achieved remarkable success in languagemodeling, its triumph hasn't yet fully translated to visuomotor agents. Aprimary challenge in RL models is their tendency to overfit specific tasks orenvironments, thereby hindering the acquisition of generalizable behaviorsacross diverse settings. This paper provides a preliminary answer to thischallenge by demonstrating that RL-finetuned visuomotor agents in Minecraft canachieve zero-shot generalization to unseen worlds. Specifically, we exploreRL's potential to enhance generalizable spatial reasoning and interactioncapabilities in 3D worlds. To address challenges in multi-task RLrepresentation, we analyze and establish cross-view goal specification as aunified multi-task goal space for visuomotor policies. Furthermore, to overcomethe significant bottleneck of manual task design, we propose automated tasksynthesis within the highly customizable Minecraft environment for large-scalemulti-task RL training, and we construct an efficient distributed RL frameworkto support this. Experimental results show RL significantly boosts interactionsuccess rates by $4\times$ and enables zero-shot generalization of spatialreasoning across diverse environments, including real-world settings. Ourfindings underscore the immense potential of RL training in 3D simulatedenvironments, especially those amenable to large-scale task generation, forsignificantly advancing visuomotor agents' spatial reasoning.

Quick Read (beta)

loading the full paper ...