Talk2BEV: Language-enhanced Bird's-eye View Maps for Autonomous Driving

Abstract

Talk2BEV is a large vision-language model (LVLM) interface for bird's-eyeview (BEV) maps in autonomous driving contexts. While existing perceptionsystems for autonomous driving scenarios have largely focused on a pre-defined(closed) set of object categories and driving scenarios, Talk2BEV blends recentadvances in general-purpose language and vision models with BEV-structured maprepresentations, eliminating the need for task-specific models. This enables asingle system to cater to a variety of autonomous driving tasks encompassingvisual and spatial reasoning, predicting the intents of traffic actors, anddecision-making based on visual cues. We extensively evaluate Talk2BEV on alarge number of scene understanding tasks that rely on both the ability tointerpret free-form natural language queries, and in grounding these queries tothe visual context embedded into the language-enhanced BEV map. To enablefurther research in LVLMs for autonomous driving scenarios, we develop andrelease Talk2BEV-Bench, a benchmark encompassing 1000 human-annotated BEVscenarios, with more than 20,000 questions and ground-truth responses from theNuScenes dataset.

Quick Read (beta)

loading the full paper ...