3D-VLA: A 3D Vision-Language-Action Generative World Model

Abstract

Recent vision-language-action (VLA) models rely on 2D inputs, lackingintegration with the broader realm of the 3D physical world. Furthermore, theyperform action prediction by learning a direct mapping from perception toaction, neglecting the vast dynamics of the world and the relations betweenactions and dynamics. In contrast, human beings are endowed with world modelsthat depict imagination about future scenarios to plan actions accordingly. Tothis end, we propose 3D-VLA by introducing a new family of embodied foundationmodels that seamlessly link 3D perception, reasoning, and action through agenerative world model. Specifically, 3D-VLA is built on top of a 3D-basedlarge language model (LLM), and a set of interaction tokens is introduced toengage with the embodied environment. Furthermore, to inject generationabilities into the model, we train a series of embodied diffusion models andalign them into the LLM for predicting the goal images and point clouds. Totrain our 3D-VLA, we curate a large-scale 3D embodied instruction dataset byextracting vast 3D-related information from existing robotics datasets. Ourexperiments on held-in datasets demonstrate that 3D-VLA significantly improvesthe reasoning, multimodal generation, and planning capabilities in embodiedenvironments, showcasing its potential in real-world applications.

Quick Read (beta)

loading the full paper ...