OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning

Abstract

The advances in multimodal large language models (MLLMs) have led to growinginterests in LLM-based autonomous driving agents to leverage their strongreasoning capabilities. However, capitalizing on MLLMs' strong reasoningcapabilities for improved planning behavior is challenging since planningrequires full 3D situational awareness beyond 2D reasoning. To address thischallenge, our work proposes a holistic framework for strong alignment betweenagent models and 3D driving tasks. Our framework starts with a novel 3D MLLMarchitecture that uses sparse queries to lift and compress visualrepresentations into 3D before feeding them into an LLM. This query-basedrepresentation allows us to jointly encode dynamic objects and static mapelements (e.g., traffic lanes), providing a condensed world model forperception-action alignment in 3D. We further propose OmniDrive-nuScenes, a newvisual question-answering dataset challenging the true 3D situational awarenessof a model with comprehensive visual question-answering (VQA) tasks, includingscene description, traffic regulation, 3D grounding, counterfactual reasoning,decision making and planning. Extensive studies show the effectiveness of theproposed architecture as well as the importance of the VQA tasks for reasoningand planning in complex 3D scenes.

Quick Read (beta)

loading the full paper ...