Latent Video Diffusion Models for High-Fidelity Video Generation with Arbitrary Lengths

Abstract

AI-generated content has attracted lots of attention recently, butphoto-realistic video synthesis is still challenging. Although many attemptsusing GANs and autoregressive models have been made in this area, the visualquality and length of generated videos are far from satisfactory. Diffusionmodels (DMs) are another class of deep generative models and have recentlyachieved remarkable performance on various image synthesis tasks. However,training image diffusion models usually requires substantial computationalresources to achieve a high performance, which makes expanding diffusion modelsto high-dimensional video synthesis tasks more computationally expensive. Toease this problem while leveraging its advantages, we introduce lightweightvideo diffusion models that synthesize high-fidelity and arbitrary-long videosfrom pure noise. Specifically, we propose to perform diffusion and denoising ina low-dimensional 3D latent space, which significantly outperforms previousmethods on 3D pixel space when under a limited computational budget. Inaddition, though trained on tens of frames, our models can generate videos witharbitrary lengths, i.e., thousands of frames, in an autoregressive way.Finally, conditional latent perturbation is further introduced to reduceperformance degradation during synthesizing long-duration videos. Extensiveexperiments on various datasets and generated lengths suggest that ourframework is able to sample much more realistic and longer videos than previousapproaches, including GAN-based, autoregressive-based, and diffusion-basedmethods.

Quick Read (beta)

loading the full paper ...