TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

Abstract

Understanding fine-grained temporal dynamics is crucial for multimodal videocomprehension and generation. Due to the lack of fine-grained temporalannotations, existing video benchmarks mostly resemble static image benchmarksand are incompetent at evaluating models for temporal understanding. In thispaper, we introduce TemporalBench, a new benchmark dedicated to evaluatingfine-grained temporal understanding in videos. TemporalBench consists of ~10Kvideo question-answer pairs, derived from ~2K high-quality human annotationsdetailing the temporal dynamics in video clips. As a result, our benchmarkprovides a unique testbed for evaluating various temporal understanding andreasoning abilities such as action frequency, motion magnitude, event order,etc. Moreover, it enables evaluations on various tasks like both video questionanswering and captioning, both short and long video understanding, as well asdifferent models such as multimodal video embedding models and text generationmodels. Results show that state-of-the-art models like GPT-4o achieve only38.5% question answering accuracy on TemporalBench, demonstrating a significantgap (~30%) between humans and AI in temporal understanding. Furthermore, wenotice a critical pitfall for multi-choice QA where LLMs can detect the subtlechanges in negative captions and find a centralized description as a cue forits prediction, where we propose Multiple Binary Accuracy (MBA) to correct suchbias. We hope that TemporalBench can foster research on improving models'temporal reasoning capabilities. Both dataset and evaluation code will be madeavailable.

Quick Read (beta)

loading the full paper ...