An Image is Worth 32 Tokens for Reconstruction and Generation

Abstract

Recent advancements in generative models have highlighted the crucial role ofimage tokenization in the efficient synthesis of high-resolution images.Tokenization, which transforms images into latent representations, reducescomputational demands compared to directly processing pixels and enhances theeffectiveness and efficiency of the generation process. Prior methods, such asVQGAN, typically utilize 2D latent grids with fixed downsampling factors.However, these 2D tokenizations face challenges in managing the inherentredundancies present in images, where adjacent regions frequently displaysimilarities. To overcome this issue, we introduce Transformer-based1-Dimensional Tokenizer (TiTok), an innovative approach that tokenizes imagesinto 1D latent sequences. TiTok provides a more compact latent representation,yielding substantially more efficient and effective representations thanconventional techniques. For example, a 256 x 256 x 3 image can be reduced tojust 32 discrete tokens, a significant reduction from the 256 or 1024 tokensobtained by prior methods. Despite its compact nature, TiTok achievescompetitive performance to state-of-the-art approaches. Specifically, using thesame generator framework, TiTok attains 1.97 gFID, outperforming MaskGITbaseline significantly by 4.21 at ImageNet 256 x 256 benchmark. The advantagesof TiTok become even more significant when it comes to higher resolution. AtImageNet 512 x 512 benchmark, TiTok not only outperforms state-of-the-artdiffusion model DiT-XL/2 (gFID 2.74 vs. 3.04), but also reduces the imagetokens by 64x, leading to 410x faster generation process. Our best-performingvariant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while stillgenerating high-quality samples 74x faster.

Quick Read (beta)

loading the full paper ...