Abstract
Latent Diffusion Models (LDMs) enable high-quality image synthesis whileavoiding excessive compute demands by training a diffusion model in acompressed lower-dimensional latent space. Here, we apply the LDM paradigm tohigh-resolution video generation, a particularly resource-intensive task. Wefirst pre-train an LDM on images only; then, we turn the image generator into avideo generator by introducing a temporal dimension to the latent spacediffusion model and fine-tuning on encoded image sequences, i.e., videos.Similarly, we temporally align diffusion model upsamplers, turning them intotemporally consistent video super resolution models. We focus on two relevantreal-world applications: Simulation of in-the-wild driving data and creativecontent creation with text-to-video modeling. In particular, we validate ourVideo LDM on real driving videos of resolution 512 x 1024, achievingstate-of-the-art performance. Furthermore, our approach can easily leverageoff-the-shelf pre-trained image LDMs, as we only need to train a temporalalignment model in that case. Doing so, we turn the publicly available,state-of-the-art text-to-image LDM Stable Diffusion into an efficient andexpressive text-to-video model with resolution up to 1280 x 2048. We show thatthe temporal layers trained in this way generalize to different fine-tunedtext-to-image LDMs. Utilizing this property, we show the first results forpersonalized text-to-video generation, opening exciting directions for futurecontent creation. Project page:https://research.nvidia.com/labs/toronto-ai/VideoLDM/