QVGen: Pushing the Limit of Quantized Video Generative Models

Abstract

Video diffusion models (DMs) have enabled high-quality video synthesis. Yet,their substantial computational and memory demands pose serious challenges toreal-world deployment, even on high-end GPUs. As a commonly adopted solution,quantization has proven notable success in reducing cost for image DMs, whileits direct application to video DMs remains ineffective. In this paper, wepresent QVGen, a novel quantization-aware training (QAT) framework tailored forhigh-performance and inference-efficient video DMs under extremely low-bitquantization (e.g., 4-bit or below). We begin with a theoretical analysisdemonstrating that reducing the gradient norm is essential to facilitateconvergence for QAT. To this end, we introduce auxiliary modules ($\Phi$) tomitigate large quantization errors, leading to significantly enhancedconvergence. To eliminate the inference overhead of $\Phi$, we propose arank-decay strategy that progressively eliminates $\Phi$. Specifically, werepeatedly employ singular value decomposition (SVD) and a proposed rank-basedregularization $\mathbf{\gamma}$ to identify and decay low-contributingcomponents. This strategy retains performance while zeroing out inferenceoverhead. Extensive experiments across $4$ state-of-the-art (SOTA) video DMs,with parameter sizes ranging from $1.3$B $\sim14$B, show that QVGen is thefirst to reach full-precision comparable quality under 4-bit settings.Moreover, it significantly outperforms existing methods. For instance, our3-bit CogVideoX-2B achieves improvements of $+25.28$ in Dynamic Degree and$+8.43$ in Scene Consistency on VBench.

Quick Read (beta)

loading the full paper ...