Abstract
Recent advances in video generation demand increasingly efficient trainingrecipes to mitigate escalating computational costs. In this report, we presentContentV, an 8B-parameter text-to-video model that achieves state-of-the-artperformance (85.14 on VBench) after training on 256 x 64GB Neural ProcessingUnits (NPUs) for merely four weeks. ContentV generates diverse, high-qualityvideos across multiple resolutions and durations from text prompts, enabled bythree key innovations: (1) A minimalist architecture that maximizes reuse ofpre-trained image generation models for video generation; (2) A systematicmulti-stage training strategy leveraging flow matching for enhanced efficiency;and (3) A cost-effective reinforcement learning with human feedback frameworkthat improves generation quality without requiring additional humanannotations. All the code and models are available at:https://contentv.github.io.