Abstract
Long video understanding (LVU) presents a significant challenge for currentmulti-modal large language models (MLLMs) due to the task's inherent complexityand context window constraint. It is widely assumed that addressing LVU tasksrequires foundation MLLMs with extended context windows, strong visualperception capabilities, and proficient domain expertise. In this work, wechallenge this common belief by introducing VideoDeepResearch, a novel agenticframework for long video understanding. Our approach relies solely on atext-only large reasoning model (LRM) combined with a modular multi-modaltoolkit, including multimodal retrievers and visual perceivers, all of whichare readily available in practice. For each LVU task, the system formulates aproblem-solving strategy through reasoning, while selectively accessing andutilizing essential video content via tool using. We conduct extensiveexperiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench.Our results demonstrate that VideoDeepResearch achieves substantialimprovements over existing MLLM baselines, surpassing the previousstate-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, andLongVideoBench, respectively. These findings highlight the promise of agenticsystems in overcoming key challenges in LVU problems.