A Survey on Vision-Language-Action Models for Embodied AI

Abstract

Embodied AI is widely recognized as a key element of artificial generalintelligence because it involves controlling embodied agents to perform tasksin the physical world. Building on the success of large language models andvision-language models, a new category of multimodal models -- referred to asvision-language-action models (VLAs) -- has emerged to addresslanguage-conditioned robotic tasks in embodied AI by leveraging their distinctability to generate actions. In recent years, a myriad of VLAs have beendeveloped, making it imperative to capture the rapidly evolving landscapethrough a comprehensive survey. To this end, we present the first survey onVLAs for embodied AI. This work provides a detailed taxonomy of VLAs, organizedinto three major lines of research. The first line focuses on individualcomponents of VLAs. The second line is dedicated to developing control policiesadept at predicting low-level actions. The third line comprises high-level taskplanners capable of decomposing long-horizon tasks into a sequence of subtasks,thereby guiding VLAs to follow more general user instructions. Furthermore, weprovide an extensive summary of relevant resources, including datasets,simulators, and benchmarks. Finally, we discuss the challenges faced by VLAsand outline promising future directions in embodied AI. We have created aproject associated with this survey, which is available athttps://github.com/yueen-ma/Awesome-VLA.

Quick Read (beta)

loading the full paper ...