DiffVLA: Vision-Language Guided Diffusion Planning for Autonomous Driving

Abstract

Research interest in end-to-end autonomous driving has surged owing to itsfully differentiable design integrating modular tasks, i.e. perception,prediction and planing, which enables optimization in pursuit of the ultimategoal. Despite the great potential of the end-to-end paradigm, existing methodssuffer from several aspects including expensive BEV (bird's eye view)computation, action diversity, and sub-optimal decision in complex real-worldscenarios. To address these challenges, we propose a novel hybrid sparse-densediffusion policy, empowered by a Vision-Language Model (VLM), called Diff-VLA.We explore the sparse diffusion representation for efficient multi-modaldriving behavior. Moreover, we rethink the effectiveness of VLM drivingdecision and improve the trajectory generation guidance through deepinteraction across agent, map instances and VLM output. Our method showssuperior performance in Autonomous Grand Challenge 2025 which containschallenging real and reactive synthetic scenarios. Our methods achieves 45.0PDMS.

Quick Read (beta)

loading the full paper ...