Abstract
We introduce LingoQA, a novel dataset and benchmark for visual questionanswering in autonomous driving. The dataset contains 28K unique short videoscenarios, and 419K annotations. Evaluating state-of-the-art vision-languagemodels on our benchmark shows that their performance is below humancapabilities, with GPT-4V responding truthfully to 59.6% of the questionscompared to 96.6% for humans. For evaluation, we propose a truthfulnessclassifier, called Lingo-Judge, that achieves a 0.95 Spearman correlationcoefficient to human evaluations, surpassing existing techniques like METEOR,BLEU, CIDEr, and GPT-4. We establish a baseline vision-language model and runextensive ablation studies to understand its performance. We release ourdataset and benchmark as an evaluation platform for vision-language models inautonomous driving.