Diffusion Autoencoders are Scalable Image Tokenizers

Abstract

Tokenizing images into compact visual representations is a key step inlearning efficient and high-quality image generative models. We present asimple diffusion tokenizer (DiTo) that learns compact visual representationsfor image generation models. Our key insight is that a single learningobjective, diffusion L2 loss, can be used for training scalable imagetokenizers. Since diffusion is already widely used for image generation, ourinsight greatly simplifies training such tokenizers. In contrast, currentstate-of-the-art tokenizers rely on an empirically found combination ofheuristics and losses, thus requiring a complex training recipe that relies onnon-trivially balancing different losses and pretrained supervised models. Weshow design decisions, along with theoretical grounding, that enable us toscale DiTo for learning competitive image representations. Our results showthat DiTo is a simpler, scalable, and self-supervised alternative to thecurrent state-of-the-art image tokenizer which is supervised. DiTo achievescompetitive or better quality than state-of-the-art in image reconstruction anddownstream image generation tasks.

Quick Read (beta)

loading the full paper ...