Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models

  • 2025-05-29 18:59:46
  • Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, Leichen Wang, Xingtao Hu, Hao Sun, Hang Zhao, Hao Zhao
  • 0

Abstract

Vision-Language-Action (VLA) models for autonomous driving show promise butfalter in unstructured corner case scenarios, largely due to a scarcity oftargeted benchmarks. To address this, we introduce Impromptu VLA. Our corecontribution is the Impromptu VLA Dataset: over 80,000 meticulously curatedvideo clips, distilled from over 2M source clips sourced from 8 open-sourcelarge-scale datasets. This dataset is built upon our novel taxonomy of fourchallenging unstructured categories and features rich, planning-orientedquestion-answering annotations and action trajectories. Crucially, experimentsdemonstrate that VLAs trained with our dataset achieve substantial performancegains on established benchmarks--improving closed-loop NeuroNCAP scores andcollision rates, and reaching near state-of-the-art L2 accuracy in open-loopnuScenes trajectory prediction. Furthermore, our Q&A suite serves as aneffective diagnostic, revealing clear VLM improvements in perception,prediction, and planning. Our code, data and models are available athttps://github.com/ahydchh/Impromptu-VLA.

 

Quick Read (beta)

loading the full paper ...