Abstract
Diffusion Models have achieved remarkable results in video synthesis butrequire iterative denoising steps, leading to substantial computationaloverhead. Consistency Models have made significant progress in acceleratingdiffusion models. However, directly applying them to video diffusion modelsoften results in severe degradation of temporal consistency and appearancedetails. In this paper, by analyzing the training dynamics of ConsistencyModels, we identify a key conflicting learning dynamics during the distillationprocess: there is a significant discrepancy in the optimization gradients andloss contributions across different timesteps. This discrepancy prevents thedistilled student model from achieving an optimal state, leading to compromisedtemporal consistency and degraded appearance details. To address this issue, wepropose a parameter-efficient \textbf{Dual-Expert Consistency Model~(DCM)},where a semantic expert focuses on learning semantic layout and motion, while adetail expert specializes in fine detail refinement. Furthermore, we introduceTemporal Coherence Loss to improve motion consistency for the semantic expertand apply GAN and Feature Matching Loss to enhance the synthesis quality of thedetail expert.Our approach achieves state-of-the-art visual quality withsignificantly reduced sampling steps, demonstrating the effectiveness of expertspecialization in video diffusion model distillation. Our code and models areavailable at\href{https://github.com/Vchitect/DCM}{https://github.com/Vchitect/DCM}.