CinePile: A Long Video Question Answering Dataset and Benchmark

Abstract

Current datasets for long-form video understanding often fall short ofproviding genuine long-form comprehension challenges, as many tasks derivedfrom these datasets can be successfully tackled by analyzing just one or a fewrandom frames from a video. To address this issue, we present a novel datasetand benchmark, CinePile, specifically designed for authentic long-form videounderstanding. This paper details our innovative approach for creating aquestion-answer dataset, utilizing advanced LLMs with human-in-the-loop andbuilding upon human-generated raw data. Our comprehensive dataset comprises305,000 multiple-choice questions (MCQs), covering various visual andmultimodal aspects, including temporal comprehension, understandinghuman-object interactions, and reasoning about events or actions within ascene. Additionally, we evaluate recent video-centric LLMs, both open-sourceand proprietary, on the test split of our dataset. The findings reveal thateven state-of-the-art video-centric LLMs significantly lag behind humanperformance in these tasks, highlighting the complexity and challenge inherentin video understanding. The dataset is available athttps://hf.co/datasets/tomg-group-umd/cinepile

Quick Read (beta)

loading the full paper ...