Abstract
End-to-end autonomous driving demonstrates strong planning capabilities withlarge-scale data but still struggles in complex, rare scenarios due to limitedcommonsense. In contrast, Large Vision-Language Models (LVLMs) excel in sceneunderstanding and reasoning. The path forward lies in merging the strengths ofboth approaches. Previous methods using LVLMs to predict trajectories orcontrol signals yield suboptimal results, as LVLMs are not well-suited forprecise numerical predictions. This paper presents Senna, an autonomous drivingsystem combining an LVLM (Senna-VLM) with an end-to-end model (Senna-E2E).Senna decouples high-level planning from low-level trajectory prediction.Senna-VLM generates planning decisions in natural language, while Senna-E2Epredicts precise trajectories. Senna-VLM utilizes a multi-image encodingapproach and multi-view prompts for efficient scene understanding. Besides, weintroduce planning-oriented QAs alongside a three-stage training strategy,which enhances Senna-VLM's planning performance while preserving commonsense.Extensive experiments on two datasets show that Senna achieves state-of-the-artplanning performance. Notably, with pre-training on a large-scale datasetDriveX and fine-tuning on nuScenes, Senna significantly reduces averageplanning error by 27.12% and collision rate by 33.33% over model withoutpre-training. We believe Senna's cross-scenario generalization andtransferability are essential for achieving fully autonomous driving. Code andmodels will be released at https://github.com/hustvl/Senna.