Abstract
Video super-resolution remains a major challenge in low-level vision tasks.To date, CNN- and Transformer-based methods have delivered impressive results.However, CNNs are limited by local receptive fields, while Transformersstruggle with quadratic complexity, posing challenges for processing longsequences in VSR. Recently, Mamba has drawn attention for its long-sequencemodeling, linear complexity, and large receptive fields. In this work, wepropose VSRM, a novel \textbf{V}ideo \textbf{S}uper-\textbf{R}esolutionframework that leverages the power of \textbf{M}amba. VSRM introducesSpatial-to-Temporal Mamba and Temporal-to-Spatial Mamba blocks to extractlong-range spatio-temporal features and enhance receptive fields efficiently.To better align adjacent frames, we propose Deformable Cross-Mamba Alignmentmodule. This module utilizes a deformable cross-mamba mechanism to make thecompensation stage more dynamic and flexible, preventing feature distortions.Finally, we minimize the frequency domain gaps between reconstructed andground-truth frames by proposing a simple yet effective FrequencyCharbonnier-like loss that better preserves high-frequency content and enhancesvisual quality. Through extensive experiments, VSRM achieves state-of-the-artresults on diverse benchmarks, establishing itself as a solid foundation forfuture research.