Abstract
Diffusion models have demonstrated exceptional capabilities in imagegeneration and restoration, yet their application to video super-resolutionfaces significant challenges in maintaining both high fidelity and temporalconsistency. We present DiffVSR, a diffusion-based framework for real-worldvideo super-resolution that effectively addresses these challenges through keyinnovations. For intra-sequence coherence, we develop a multi-scale temporalattention module and temporal-enhanced VAE decoder that capture fine-grainedmotion details. To ensure inter-sequence stability, we introduce a noiserescheduling mechanism with an interweaved latent transition approach, whichenhances temporal consistency without additional training overhead. We proposea progressive learning strategy that transitions from simple to complexdegradations, enabling robust optimization despite limited high-quality videodata. Extensive experiments demonstrate that DiffVSR delivers superior resultsin both visual quality and temporal consistency, setting a new performancestandard in real-world video super-resolution.