Abstract
This paper considers the problem of Multi-Hop Video Question Answering(MH-VidQA) in long-form egocentric videos. This task not only requires toanswer visual questions, but also to localize multiple relevant time intervalswithin the video as visual evidences. We develop an automated pipeline tocreate multi-hop question-answering pairs with associated temporal evidence,enabling to construct a large-scale dataset for instruction-tuning. To monitorthe progress of this new task, we further curate a high-quality benchmark,MultiHop-EgoQA, with careful manual verification and refinement. Experimentalresults reveal that existing multi-modal systems exhibit inadequate multi-hopgrounding and reasoning abilities, resulting in unsatisfactory performance. Wethen propose a novel architecture, termed as Grounding Scattered Evidence withLarge Language Model (GeLM), that enhances multi-modal large language models(MLLMs) by incorporating a grounding module to retrieve temporal evidence fromvideos using flexible grounding tokens. Trained on our visual instruction data,GeLM demonstrates improved multi-hop grounding and reasoning capabilities,setting a new baseline for this challenging task. Furthermore, when trained onthird-person view videos, the same architecture also achieves state-of-the-artperformance on the single-hop VidQA benchmark, ActivityNet-RTL, demonstratingits effectiveness.