MINERVA: Evaluating Complex Video Reasoning

Abstract

Multimodal LLMs are turning their focus to video benchmarks, however mostvideo benchmarks only provide outcome supervision, with no intermediate orinterpretable reasoning steps. This makes it challenging to assess if modelsare truly able to combine perceptual and temporal information to reason aboutvideos, or simply get the correct answer by chance or by exploiting linguisticbiases. To remedy this, we provide a new video reasoning dataset called MINERVAfor modern multimodal models. Each question in the dataset comes with 5 answerchoices, as well as detailed, hand-crafted reasoning traces. Our dataset ismultimodal, diverse in terms of video domain and length, and consists ofcomplex multi-step questions. Extensive benchmarking shows that our datasetprovides a challenge for frontier open-source and proprietary models. Weperform fine-grained error analysis to identify common failure modes acrossvarious models, and create a taxonomy of reasoning errors. We use this toexplore both human and LLM-as-a-judge methods for scoring video reasoningtraces, and find that failure modes are primarily related to temporallocalization, followed by visual perception errors, as opposed to logical orcompleteness errors. The dataset, along with questions, answer candidates andreasoning traces will be publicly available underhttps://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva.

Quick Read (beta)

loading the full paper ...