Rethinking Video Tokenization: A Conditioned Diffusion-based Approach

Abstract

Existing video tokenizers typically use the traditional VariationalAutoencoder (VAE) architecture for video compression and reconstruction.However, to achieve good performance, its training process often relies oncomplex multi-stage training tricks that go beyond basic reconstruction lossand KL regularization. Among these tricks, the most challenging is the precisetuning of adversarial training with additional Generative Adversarial Networks(GANs) in the final stage, which can hinder stable convergence. In contrast toGANs, diffusion models offer more stable training processes and can generatehigher-quality results. Inspired by these advantages, we propose CDT, a novelConditioned Diffusion-based video Tokenizer, that replaces the GAN-baseddecoder with a conditional causal diffusion model. The encoder compressesspatio-temporal information into compact latents, while the decoderreconstructs videos through a reverse diffusion process conditioned on theselatents. During inference, we incorporate a feature cache mechanism to generatevideos of arbitrary length while maintaining temporal continuity and adoptsampling acceleration technique to enhance efficiency. Trained using only abasic MSE diffusion loss for reconstruction, along with KL term and LPIPSperceptual loss from scratch, extensive experiments demonstrate that CDTachieves state-of-the-art performance in video reconstruction tasks with just asingle-step sampling. Even a scaled-down version of CDT (3$\times$ inferencespeedup) still performs comparably with top baselines. Moreover, the latentvideo generation model trained with CDT also exhibits superior performance. Thesource code and pretrained weights are available athttps://github.com/ali-vilab/CDT.

Quick Read (beta)

loading the full paper ...