FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution

Abstract

Diffusion models have recently advanced video restoration, but applying themto real-world video super-resolution (VSR) remains challenging due to highlatency, prohibitive computation, and poor generalization to ultra-highresolutions. Our goal in this work is to make diffusion-based VSR practical byachieving efficiency, scalability, and real-time performance. To this end, wepropose FlashVSR, the first diffusion-based one-step streaming frameworktowards real-time VSR. FlashVSR runs at approximately 17 FPS for 768x1408videos on a single A100 GPU by combining three complementary innovations: (i) atrain-friendly three-stage distillation pipeline that enables streamingsuper-resolution, (ii) locality-constrained sparse attention that cutsredundant computation while bridging the train-test resolution gap, and (iii) atiny conditional decoder that accelerates reconstruction without sacrificingquality. To support large-scale training, we also construct VSR-120K, a newdataset with 120k videos and 180k images. Extensive experiments show thatFlashVSR scales reliably to ultra-high resolutions and achievesstate-of-the-art performance with up to 12x speedup over prior one-stepdiffusion VSR models. We will release the code, pretrained models, and datasetto foster future research in efficient diffusion-based VSR.

Quick Read (beta)

loading the full paper ...