ContentV: Efficient Training of Video Generation Models with Limited Compute

Abstract

Recent advances in video generation demand increasingly efficient trainingrecipes to mitigate escalating computational costs. In this report, we presentContentV, an 8B-parameter text-to-video model that achieves state-of-the-artperformance (85.14 on VBench) after training on 256 x 64GB Neural ProcessingUnits (NPUs) for merely four weeks. ContentV generates diverse, high-qualityvideos across multiple resolutions and durations from text prompts, enabled bythree key innovations: (1) A minimalist architecture that maximizes reuse ofpre-trained image generation models for video generation; (2) A systematicmulti-stage training strategy leveraging flow matching for enhanced efficiency;and (3) A cost-effective reinforcement learning with human feedback frameworkthat improves generation quality without requiring additional humanannotations. All the code and models are available at:https://contentv.github.io.

Quick Read (beta)

loading the full paper ...