Abstract
The development of video diffusion models unveils a significant challenge:the substantial computational demands. To mitigate this challenge, we note thatthe reverse process of diffusion exhibits an inherent entropy-reducing nature.Given the inter-frame redundancy in video modality, maintaining full framerates in high-entropy stages is unnecessary. Based on this insight, we proposeTPDiff, a unified framework to enhance training and inference efficiency. Bydividing diffusion into several stages, our framework progressively increasesframe rate along the diffusion process with only the last stage operating onfull frame rate, thereby optimizing computational efficiency. To train themulti-stage diffusion model, we introduce a dedicated training framework:stage-wise diffusion. By solving the partitioned probability flow ordinarydifferential equations (ODE) of diffusion under aligned data and noise, ourtraining strategy is applicable to various diffusion forms and further enhancestraining efficiency. Comprehensive experimental evaluations validate thegenerality of our method, demonstrating 50% reduction in training cost and 1.5ximprovement in inference efficiency.