Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models

Abstract

Foundation Vision Language Models (VLMs) exhibit strong capabilities inmulti-modal representation learning, comprehension, and reasoning. By injectingaction components into the VLMs, Vision-Language-Action Models (VLAs) can benaturally formed and also show promising performance. Existing work hasdemonstrated the effectiveness and generalization of VLAs in multiple scenariosand tasks. Nevertheless, the transfer from VLMs to VLAs is not trivial sinceexisting VLAs differ in their backbones, action-prediction formulations, datadistributions, and training recipes. This leads to a missing piece for asystematic understanding of the design choices of VLAs. In this work, wedisclose the key factors that significantly influence the performance of VLAand focus on answering three essential design choices: which backbone toselect, how to formulate the VLA architectures, and when to addcross-embodiment data. The obtained results convince us firmly to explain whywe need VLA and develop a new family of VLAs, RoboVLMs, which require very fewmanual designs and achieve a new state-of-the-art performance in threesimulation tasks and real-world experiments. Through our extensive experiments,which include over 8 VLM backbones, 4 policy architectures, and over 600distinct designed experiments, we provide a detailed guidebook for the futuredesign of VLAs. In addition to the study, the highly flexible RoboVLMsframework, which supports easy integrations of new VLMs and free combinationsof various design choices, is made public to facilitate future research. Weopen-source all details, including codes, models, datasets, and toolkits, alongwith detailed training and evaluation recipes at: robovlms.github.io.

Quick Read (beta)

loading the full paper ...