Abstract
Autonomous driving, particularly navigating complex and unanticipatedscenarios, demands sophisticated reasoning and planning capabilities. WhileMulti-modal Large Language Models (MLLMs) offer a promising avenue for this,their use has been largely confined to understanding complex environmentalcontexts or generating high-level driving commands, with few studies extendingtheir application to end-to-end path planning. A major research bottleneck isthe lack of large-scale annotated datasets encompassing vision, language, andaction. To address this issue, we propose CoVLA (ComprehensiveVision-Language-Action) Dataset, an extensive dataset comprising real-worlddriving videos spanning more than 80 hours. This dataset leverages a novel,scalable approach based on automated data processing and a caption generationpipeline to generate accurate driving trajectories paired with detailed naturallanguage descriptions of driving environments and maneuvers. This approachutilizes raw in-vehicle sensor data, allowing it to surpass existing datasetsin scale and annotation richness. Using CoVLA, we investigate the drivingcapabilities of MLLMs that can handle vision, language, and action in a varietyof driving scenarios. Our results illustrate the strong proficiency of ourmodel in generating coherent language and action outputs, emphasizing thepotential of Vision-Language-Action (VLA) models in the field of autonomousdriving. This dataset establishes a framework for robust, interpretable, anddata-driven autonomous driving systems by providing a comprehensive platformfor training and evaluating VLA models, contributing to safer and more reliableself-driving vehicles. The dataset is released for academic purpose.