VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Abstract

Long-form video understanding represents a significant challenge withincomputer vision, demanding a model capable of reasoning over long multi-modalsequences. Motivated by the human cognitive process for long-form videounderstanding, we emphasize interactive reasoning and planning over the abilityto process lengthy visual inputs. We introduce a novel agent-based system,VideoAgent, that employs a large language model as a central agent toiteratively identify and compile crucial information to answer a question, withvision-language foundation models serving as tools to translate and retrievevisual information. Evaluated on the challenging EgoSchema and NExT-QAbenchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only8.4 and 8.2 frames used on average. These results demonstrate superioreffectiveness and efficiency of our method over the current state-of-the-artmethods, highlighting the potential of agent-based approaches in advancinglong-form video understanding.

Quick Read (beta)

loading the full paper ...