Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

Abstract

Video tokenizers are essential for latent video diffusion models, convertingraw video data into spatiotemporally compressed latent spaces for efficienttraining. However, extending state-of-the-art video tokenizers to achieve atemporal compression ratio beyond 4x without increasing channel capacity posessignificant challenges. In this work, we propose an alternative approach toenhance temporal compression. We find that the reconstruction quality oftemporally subsampled videos from a low-compression encoder surpasses that ofhigh-compression encoders applied to original videos. This indicates thathigh-compression models can leverage representations from lower-compressionmodels. Building on this insight, we develop a bootstrappedhigh-temporal-compression model that progressively trains high-compressionblocks atop well-trained lower-compression models. Our method includes across-level feature-mixing module to retain information from the pretrainedlow-compression model and guide higher-compression blocks to capture theremaining details from the full video sequence. Evaluation of video benchmarksshows that our method significantly improves reconstruction quality whileincreasing temporal compression compared to direct extensions of existing videotokenizers. Furthermore, the resulting compact latent space effectively trainsa video diffusion model for high-quality video generation with a reduced tokenbudget.

Quick Read (beta)

loading the full paper ...