Abstract
This research aims to comprehensively explore building a multimodalfoundation model for egocentric video understanding. To achieve this goal, wework on three fronts. First, as there is a lack of QA data for egocentric videounderstanding, we develop a data engine that efficiently generates 7Mhigh-quality QA samples for egocentric videos ranging from 30 seconds to onehour long, based on human-annotated data. This is currently the largestegocentric QA dataset. Second, we contribute a challenging egocentric QAbenchmark with 629 videos and 7,026 questions to evaluate the models' abilityin recognizing and memorizing visual details across videos of varying lengths.We introduce a new de-biasing evaluation method to help mitigate theunavoidable language bias present in the models being evaluated. Third, wepropose a specialized multimodal architecture featuring a novel "Memory PointerPrompting" mechanism. This design includes a global glimpse step to gain anoverarching understanding of the entire video and identify key visualinformation, followed by a fallback step that utilizes the key visualinformation to generate responses. This enables the model to more effectivelycomprehend extended video content. With the data, benchmark, and model, wesuccessfully build MM-Ego, an egocentric multimodal LLM that shows powerfulperformance on egocentric video understanding.