LVD-2M: A Long-take Video Dataset with Temporally Dense Captions

Abstract

The efficacy of video generation models heavily depends on the quality oftheir training datasets. Most previous video generation models are trained onshort video clips, while recently there has been increasing interest intraining long video generation models directly on longer videos. However, thelack of such high-quality long videos impedes the advancement of long videogeneration. To promote research in long video generation, we desire a newdataset with four key features essential for training long video generationmodels: (1) long videos covering at least 10 seconds, (2) long-take videoswithout cuts, (3) large motion and diverse contents, and (4) temporally densecaptions. To achieve this, we introduce a new pipeline for selectinghigh-quality long-take videos and generating temporally dense captions.Specifically, we define a set of metrics to quantitatively assess video qualityincluding scene cuts, dynamic degrees, and semantic-level quality, enabling usto filter high-quality long-take videos from a large amount of source videos.Subsequently, we develop a hierarchical video captioning pipeline to annotatelong videos with temporally-dense captions. With this pipeline, we curate thefirst long-take video dataset, LVD-2M, comprising 2 million long-take videos,each covering more than 10 seconds and annotated with temporally densecaptions. We further validate the effectiveness of LVD-2M by fine-tuning videogeneration models to generate long videos with dynamic motions. We believe ourwork will significantly contribute to future research in long video generation.

Quick Read (beta)

loading the full paper ...