Abstract
Mathematical reasoning in real-world video settings presents a fundamentallydifferent challenge than in static images or text. It requires interpretingfine-grained visual information, accurately reading handwritten or digitaltext, and integrating spoken cues, often dispersed non-linearly over time. Insuch multimodal contexts, success hinges not just on perception, but onselectively identifying and integrating the right contextual details from arich and noisy stream of content. To this end, we introduce VideoMathQA, abenchmark designed to evaluate whether models can perform such temporallyextended cross-modal reasoning on videos. The benchmark spans 10 diversemathematical domains, covering videos ranging from 10 seconds to over 1 hour.It requires models to interpret structured visual content, understandinstructional narratives, and jointly ground concepts across visual, audio, andtextual modalities. We employ graduate-level experts to ensure high quality,totaling over $920$ man-hours of annotation. To reflect real-world scenarios,questions are designed around three core reasoning challenges: direct problemsolving, where answers are grounded in the presented question; conceptualtransfer, which requires applying learned methods to new problems; and deepinstructional comprehension, involving multi-step reasoning over extendedexplanations and partially worked-out solutions. Each question includesmulti-step reasoning annotations, enabling fine-grained diagnosis of modelcapabilities. Through this benchmark, we highlight the limitations of existingapproaches and establish a systematic evaluation framework for models that mustreason, rather than merely perceive, across temporally extended andmodality-rich mathematical problem settings. Our benchmark and evaluation codeare available at: https://mbzuai-oryx.github.io/VideoMathQA