Grounding Video Models to Actions through Goal Conditioned Exploration

Abstract

Large video models, pretrained on massive amounts of Internet video, providea rich source of physical knowledge about the dynamics and motions of objectsand tasks. However, video models are not grounded in the embodiment of anagent, and do not describe how to actuate the world to reach the visual statesdepicted in a video. To tackle this problem, current methods use a separatevision-based inverse dynamic model trained on embodiment-specific data to mapimage states to actions. Gathering data to train such a model is oftenexpensive and challenging, and this model is limited to visual settings similarto the ones in which data are available. In this paper, we investigate how todirectly ground video models to continuous actions through self-exploration inthe embodied environment -- using generated video states as visual goals forexploration. We propose a framework that uses trajectory level actiongeneration in combination with video guidance to enable an agent to solvecomplex tasks without any external supervision, e.g., rewards, action labels,or segmentation masks. We validate the proposed approach on 8 tasks in Libero,6 tasks in MetaWorld, 4 tasks in Calvin, and 12 tasks in iThor VisualNavigation. We show how our approach is on par with or even surpasses multiplebehavior cloning baselines trained on expert demonstrations while withoutrequiring any action annotations.

Quick Read (beta)

loading the full paper ...