Unleashing Hour-Scale Video Training for Long Video-Language Understanding

Abstract

Recent long-form video-language understanding benchmarks have driven progressin video large multimodal models (Video-LMMs). However, the scarcity ofwell-annotated long videos has left the training of hour-long Video-LLMsunderexplored. To close this gap, we present VideoMarathon, a large-scalehour-long video instruction-following dataset. This dataset includes around9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60minutes per video. Specifically, it contains 3.3M high-quality QA pairs,spanning six fundamental topics: temporality, spatiality, object, action,scene, and event. Compared to existing video instruction datasets,VideoMarathon significantly extends training video durations up to 1 hour, andsupports 22 diverse tasks requiring both short- and long-term videocomprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful andefficient Video-LMM for hour-scale video-language modeling. It enableshour-long video training and inference at 1-FPS sampling by leveraging a memoryaugmentation module, which adaptively integrates user question-relevant andspatiotemporal-informative semantics from a cached full video context. In ourexperiments, Hour-LLaVA achieves the best performance on multiple longvideo-language benchmarks, demonstrating the high quality of the VideoMarathondataset and the superiority of the Hour-LLaVA model.

Quick Read (beta)

loading the full paper ...