Factorized Visual Tokenization and Generation

Abstract

Visual tokenizers are fundamental to image generation. They convert visualdata into discrete tokens, enabling transformer-based models to excel at imagegeneration. Despite their success, VQ-based tokenizers like VQGAN facesignificant limitations due to constrained vocabulary sizes. Simply expandingthe codebook often leads to training instability and diminishing performancegains, making scalability a critical challenge. In this work, we introduceFactorized Quantization (FQ), a novel approach that revitalizes VQ-basedtokenizers by decomposing a large codebook into multiple independentsub-codebooks. This factorization reduces the lookup complexity of largecodebooks, enabling more efficient and scalable visual tokenization. To ensureeach sub-codebook captures distinct and complementary information, we propose adisentanglement regularization that explicitly reduces redundancy, promotingdiversity across the sub-codebooks. Furthermore, we integrate representationlearning into the training process, leveraging pretrained vision models likeCLIP and DINO to infuse semantic richness into the learned representations.This design ensures our tokenizer captures diverse semantic levels, leading tomore expressive and disentangled representations. Experiments show that theproposed FQGAN model substantially improves the reconstruction quality ofvisual tokenizers, achieving state-of-the-art performance. We furtherdemonstrate that this tokenizer can be effectively adapted into auto-regressiveimage generation. https://showlab.github.io/FQGAN

Quick Read (beta)

loading the full paper ...