DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding

Abstract

While large multimodal models (LMMs) have demonstrated strong performanceacross various Visual Question Answering (VQA) tasks, certain challengesrequire complex multi-step reasoning to reach accurate answers. Oneparticularly challenging task is autonomous driving, which demands thoroughcognitive processing before decisions can be made. In this domain, a sequentialand interpretive understanding of visual cues is essential for effectiveperception, prediction, and planning. Nevertheless, common VQA benchmarks oftenfocus on the accuracy of the final answer while overlooking the reasoningprocess that enables the generation of accurate responses. Moreover, existingmethods lack a comprehensive framework for evaluating step-by-step reasoning inrealistic driving scenarios. To address this gap, we propose DriveLMM-o1, a newdataset and benchmark specifically designed to advance step-wise visualreasoning for autonomous driving. Our benchmark features over 18k VQA examplesin the training set and more than 4k in the test set, covering diversequestions on perception, prediction, and planning, each enriched withstep-by-step reasoning to ensure logical inference in autonomous drivingscenarios. We further introduce a large multimodal model that is fine-tuned onour reasoning dataset, demonstrating robust performance in complex drivingscenarios. In addition, we benchmark various open-source and closed-sourcemethods on our proposed dataset, systematically comparing their reasoningcapabilities for autonomous driving tasks. Our model achieves a +7.49% gain infinal answer accuracy, along with a 3.62% improvement in reasoning score overthe previous best open-source model. Our framework, dataset, and model areavailable at https://github.com/ayesha-ishaq/DriveLMM-o1.

Quick Read (beta)

loading the full paper ...