Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Abstract

Visual tokenization via auto-encoding empowers state-of-the-art image andvideo generative models by compressing pixels into a latent space. Althoughscaling Transformer-based generators has been central to recent advances, thetokenizer component itself is rarely scaled, leaving open questions about howauto-encoder design choices influence both its objective of reconstruction anddownstream generative performance. Our work aims to conduct an exploration ofscaling in auto-encoders to fill in this blank. To facilitate this exploration,we replace the typical convolutional backbone with an enhanced VisionTransformer architecture for Tokenization (ViTok). We train ViTok onlarge-scale image and video datasets far exceeding ImageNet-1K, removing dataconstraints on tokenizer scaling. We first study how scaling the auto-encoderbottleneck affects both reconstruction and generation -- and find that while itis highly correlated with reconstruction, its relationship with generation ismore complex. We next explored the effect of separately scaling theauto-encoders' encoder and decoder on reconstruction and generationperformance. Crucially, we find that scaling the encoder yields minimal gainsfor either reconstruction or generation, while scaling the decoder boostsreconstruction but the benefits for generation are mixed. Building on ourexploration, we design ViTok as a lightweight auto-encoder that achievescompetitive performance with state-of-the-art auto-encoders on ImageNet-1K andCOCO reconstruction tasks (256p and 512p) while outperforming existingauto-encoders on 16-frame 128p video reconstruction for UCF-101, all with 2-5xfewer FLOPs. When integrated with Diffusion Transformers, ViTok demonstratescompetitive performance on image generation for ImageNet-1K and sets newstate-of-the-art benchmarks for class-conditional video generation on UCF-101.

Quick Read (beta)

loading the full paper ...