SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

Abstract

Integrating large language models (LLMs) into autonomous driving hasattracted significant attention with the hope of improving generalization andexplainability. However, existing methods often focus on either driving orvision-language understanding but achieving both high driving performance andextensive language understanding remains challenging. In addition, the dominantapproach to tackle vision-language understanding is using visual questionanswering. However, for autonomous driving, this is only useful if it isaligned with the action space. Otherwise, the model's answers could beinconsistent with its behavior. Therefore, we propose a model that can handlethree different tasks: (1) closed-loop driving, (2) vision-languageunderstanding, and (3) language-action alignment. Our model SimLingo is basedon a vision language model (VLM) and works using only camera, excludingexpensive sensors like LiDAR. SimLingo obtains state-of-the-art performance onthe widely used CARLA simulator on the Bench2Drive benchmark and is the winningentry at the CARLA challenge 2024. Additionally, we achieve strong results in awide variety of language-related tasks while maintaining high drivingperformance.

Quick Read (beta)

loading the full paper ...