HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

  • 2025-03-13 18:59:52
  • Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, Chengkai Hou, Mengdi Zhao, KC alex Zhou, Pheng-Ann Heng, Shanghang Zhang
  • 0

Abstract

Recent advancements in vision-language models (VLMs) for common-sensereasoning have led to the development of vision-language-action (VLA) models,enabling robots to perform generalized manipulation. Although existingautoregressive VLA methods leverage large-scale pretrained knowledge, theydisrupt the continuity of actions. Meanwhile, some VLA methods incorporate anadditional diffusion head to predict continuous actions, relying solely onVLM-extracted features, which limits their reasoning capabilities. In thispaper, we introduce HybridVLA, a unified framework that seamlessly integratesthe strengths of both autoregressive and diffusion policies within a singlelarge language model, rather than simply connecting them. To bridge thegeneration gap, a collaborative training recipe is proposed that injects thediffusion modeling directly into the next-token prediction. With this recipe,we find that these two forms of action prediction not only reinforce each otherbut also exhibit varying performance across different tasks. Therefore, wedesign a collaborative action ensemble mechanism that adaptively fuses thesetwo predictions, leading to more robust control. In experiments, HybridVLAoutperforms previous state-of-the-art VLA methods across various simulation andreal-world tasks, including both single-arm and dual-arm robots, whiledemonstrating stable manipulation in previously unseen configurations.

 

Quick Read (beta)

loading the full paper ...