Abstract
Vision-Language-Action (VLA) models for autonomous driving show promise butfalter in unstructured corner case scenarios, largely due to a scarcity oftargeted benchmarks. To address this, we introduce Impromptu VLA. Our corecontribution is the Impromptu VLA Dataset: over 80,000 meticulously curatedvideo clips, distilled from over 2M source clips sourced from 8 open-sourcelarge-scale datasets. This dataset is built upon our novel taxonomy of fourchallenging unstructured categories and features rich, planning-orientedquestion-answering annotations and action trajectories. Crucially, experimentsdemonstrate that VLAs trained with our dataset achieve substantial performancegains on established benchmarks--improving closed-loop NeuroNCAP scores andcollision rates, and reaching near state-of-the-art L2 accuracy in open-loopnuScenes trajectory prediction. Furthermore, our Q&A suite serves as aneffective diagnostic, revealing clear VLM improvements in perception,prediction, and planning. Our code, data and models are available athttps://github.com/ahydchh/Impromptu-VLA.