Abstract
We present a generic video super-resolution algorithm in this paper, based onthe Diffusion Posterior Sampling framework with an unconditional videogeneration model in latent space. The video generation model, a diffusiontransformer, functions as a space-time model. We argue that a powerful model,which learns the physics of the real world, can easily handle various kinds ofmotion patterns as prior knowledge, thus eliminating the need for explicitestimation of optical flows or motion parameters for pixel alignment.Furthermore, a single instance of the proposed video diffusion transformermodel can adapt to different sampling conditions without re-training. Due tolimited computational resources and training data, our experiments provideempirical evidence of the algorithm's strong super-resolution capabilitiesusing synthetic data.