Abstract
We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving.Built on a multi-modal large language model foundation, EMMA directly maps rawcamera sensor data into various driving-specific outputs, including plannertrajectories, perception objects, and road graph elements. EMMA maximizes theutility of world knowledge from the pre-trained large language models, byrepresenting all non-sensor inputs (e.g. navigation instructions and egovehicle status) and outputs (e.g. trajectories and 3D locations) as naturallanguage text. This approach allows EMMA to jointly process various drivingtasks in a unified language space, and generate the outputs for each task usingtask-specific prompts. Empirically, we demonstrate EMMA's effectiveness byachieving state-of-the-art performance in motion planning on nuScenes as wellas competitive results on the Waymo Open Motion Dataset (WOMD). EMMA alsoyields competitive results for camera-primary 3D object detection on the WaymoOpen Dataset (WOD). We show that co-training EMMA with planner trajectories,object detection, and road graph tasks yields improvements across all threedomains, highlighting EMMA's potential as a generalist model for autonomousdriving applications. However, EMMA also exhibits certain limitations: it canprocess only a small amount of image frames, does not incorporate accurate 3Dsensing modalities like LiDAR or radar and is computationally expensive. Wehope that our results will inspire further research to mitigate these issuesand to further evolve the state of the art in autonomous driving modelarchitectures.