Abstract
This paper addresses the task of video question answering (videoQA) via adecomposed multi-stage, modular reasoning framework. Previous modular methodshave shown promise with a single planning stage ungrounded in visual content.However, through a simple and effective baseline, we find that such systems canlead to brittle behavior in practice for challenging videoQA settings. Thus,unlike traditional single-stage planning methods, we propose a multi-stagesystem consisting of an event parser, a grounding stage, and a final reasoningstage in conjunction with an external memory. All stages are training-free, andperformed using few-shot prompting of large models, creating interpretableintermediate outputs at each stage. By decomposing the underlying planning andtask complexity, our method, MoReVQA, improves over prior work on standardvideoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) withstate-of-the-art results, and extensions to related tasks (grounded videoQA,paragraph captioning).