Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

  • 2025-07-23 17:26:05
  • Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, Yan Lu
  • 0

Abstract

Long-form video understanding presents significant challenges due toextensive temporal-spatial complexity and the difficulty of question answeringunder such extended contexts. While Large Language Models (LLMs) havedemonstrated considerable advancements in video analysis capabilities and longcontext handling, they continue to exhibit limitations when processinginformation-dense hour-long videos. To overcome such limitations, we proposethe Deep Video Discovery agent to leverage an agentic search strategy oversegmented video clips. Different from previous video agents manually designinga rigid workflow, our approach emphasizes the autonomous nature of agents. Byproviding a set of search-centric tools on multi-granular video database, ourDVD agent leverages the advanced reasoning capability of LLM to plan on itscurrent observation state, strategically selects tools, formulates appropriateparameters for actions, and iteratively refines its internal reasoning in lightof the gathered information. We perform comprehensive evaluation on multiplelong video understanding benchmarks that demonstrates the advantage of theentire system design. Our DVD agent achieves SOTA performance, significantlysurpassing prior works by a large margin on the challenging LVBench dataset.Comprehensive ablation studies and in-depth tool analyses are also provided,yielding insights to further advance intelligent agents tailored for long-formvideo understanding tasks. The code has been released inhttps://github.com/microsoft/DeepVideoDiscovery.

 

Quick Read (beta)

loading the full paper ...