Abstract
Recent advancements in Large Video Language Models (LVLMs) have highlightedtheir potential for multi-modal understanding, yet evaluating their factualgrounding in video contexts remains a critical unsolved challenge. To addressthis gap, we introduce Video SimpleQA, the first comprehensive benchmarktailored for factuality evaluation of LVLMs. Our work distinguishes fromexisting video benchmarks through the following key features: 1) Knowledgerequired: demanding integration of external knowledge beyond the explicitnarrative; 2) Fact-seeking question: targeting objective, undisputed events orrelationships, avoiding subjective interpretation; 3) Definitive & short-formanswer: Answers are crafted as unambiguous and definitively correct in a shortformat, enabling automated evaluation through LLM-as-a-judge frameworks withminimal scoring variance; 4) External-source verified: All annotations undergorigorous validation against authoritative external references to ensure thereliability; 5) Temporal reasoning required: The annotated question typesencompass both static single-frame understanding and dynamic temporalreasoning, explicitly evaluating LVLMs factuality under the long-contextdependencies. We extensively evaluate 41 state-of-the-art LVLMs and summarizekey findings as follows: 1) Current LVLMs exhibit notable deficiencies infactual adherence, particularly for open-source models. The best-performingmodel Gemini-1.5-Pro achieves merely an F-score of 54.4%; 2) Test-time computeparadigms show insignificant performance gains, revealing fundamentalconstraints for enhancing factuality through post-hoc computation; 3)Retrieval-Augmented Generation demonstrates consistent improvements at the costof additional inference time overhead, presenting a criticalefficiency-performance trade-off.