VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

Abstract

The quality of video-text pairs fundamentally determines the upper bound oftext-to-video models. Currently, the datasets used for training these modelssuffer from significant shortcomings, including low temporal consistency,poor-quality captions, substandard video quality, and imbalanced datadistribution. The prevailing video curation process, which depends on imagemodels for tagging and manual rule-based curation, leads to a highcomputational load and leaves behind unclean data. As a result, there is a lackof appropriate training datasets for text-to-video models. To address thisproblem, we present VidGen-1M, a superior training dataset for text-to-videomodels. Produced through a coarse-to-fine curation strategy, this datasetguarantees high-quality videos and detailed captions with excellent temporalconsistency. When used to train the video generation model, this dataset hasled to experimental results that surpass those obtained with other models.

Quick Read (beta)

loading the full paper ...