From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment

Abstract

While generative robot policies have demonstrated significant potential inlearning complex, multimodal behaviors from demonstrations, they still exhibitdiverse failures at deployment-time. Policy steering offers an elegant solutionto reducing the chance of failure by using an external verifier to select fromlow-level actions proposed by an imperfect generative policy. Here, one mighthope to use a Vision Language Model (VLM) as a verifier, leveraging itsopen-world reasoning capabilities. However, off-the-shelf VLMs struggle tounderstand the consequences of low-level robot actions as they are representedfundamentally differently than the text and images the VLM was trained on. Inresponse, we propose FOREWARN, a novel framework to unlock the potential ofVLMs as open-vocabulary verifiers for runtime policy steering. Our key idea isto decouple the VLM's burden of predicting action outcomes (foresight) fromevaluation (forethought). For foresight, we leverage a latent world model toimagine future latent states given diverse low-level action plans. Forforethought, we align the VLM with these predicted latent states to reasonabout the consequences of actions in its native representation--naturallanguage--and effectively filter proposed plans. We validate our frameworkacross diverse robotic manipulation tasks, demonstrating its ability to bridgerepresentational gaps and provide robust, generalizable policy steering. Videoscan be found on the project website: https://yilin-wu98.github.io/forewarn/.

Quick Read (beta)

loading the full paper ...