Abstract
Text-based diffusion models have exhibited remarkable success in generationand editing, showing great promise for enhancing visual content with theirgenerative prior. However, applying these models to video super-resolutionremains challenging due to the high demands for output fidelity and temporalconsistency, which is complicated by the inherent randomness in diffusionmodels. Our study introduces Upscale-A-Video, a text-guided latent diffusionframework for video upscaling. This framework ensures temporal coherencethrough two key mechanisms: locally, it integrates temporal layers into U-Netand VAE-Decoder, maintaining consistency within short sequences; globally,without training, a flow-guided recurrent latent propagation module isintroduced to enhance overall video stability by propagating and fusing latentacross the entire sequences. Thanks to the diffusion paradigm, our model alsooffers greater flexibility by allowing text prompts to guide texture creationand adjustable noise levels to balance restoration and generation, enabling atrade-off between fidelity and quality. Extensive experiments show thatUpscale-A-Video surpasses existing methods in both synthetic and real-worldbenchmarks, as well as in AI-generated videos, showcasing impressive visualrealism and temporal consistency.