From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning

Abstract

Navigation foundation models trained on massive webscale data enable agentsto generalize across diverse environments and embodiments. However, thesemodels trained solely on offline data, often lack the capacity to reason aboutthe consequences of their actions or adapt through counterfactualunderstanding. They thus face significant limitations in the real-world urbannavigation where interactive and safe behaviors, such as avoiding obstacles andmoving pedestrians, are critical. To tackle these challenges, we introduce theSeeing-to-Experiencing framework to scale the capability of navigationfoundation models with reinforcement learning. S2E combines the strengths ofpre-training on videos and post-training through RL. It maintains thegeneralizability acquired from large-scale real-world videos while enhancingits interactivity through RL in simulation environments. Specifically, weintroduce two innovations: an Anchor-Guided Distribution Matching strategy,which stabilizes learning and models diverse motion patterns throughanchor-based supervision; and a Residual-Attention Module, which obtainsreactive behaviors from simulation environments without erasing the model'spretrained knowledge. Moreover, we establish a comprehensive end-to-endevaluation benchmark, NavBench-GS, built on photorealistic 3DGS reconstructionsof real-world scenes that incorporate physical interactions. It cansystematically assess the generalizability and safety of navigation foundationmodels. Extensive experiments show that S2E mitigates the diminishing returnsoften seen when scaling with offline data alone. We perform a thorough analysisof the benefits of Reinforcement Learning compared to Supervised Fine-Tuning inthe context of post-training for robot learning. Our findings emphasize thecrucial role of integrating interactive online experiences to effectively scalefoundation models in Robotics.

Quick Read (beta)

loading the full paper ...