Abstract
The image tokenizer is a critical component in AR image generation, as itdetermines how rich and structured visual content is encoded into compactrepresentations. Existing quantization-based tokenizers such as VQ-GANprimarily focus on appearance features like texture and color, often neglectinggeometric structures due to their patch-based design. In this work, we exploredhow to incorporate more visual information into the tokenizer and proposed anew framework named Visual Gaussian Quantization (VGQ), a novel tokenizerparadigm that explicitly enhances structural modeling by integrating 2DGaussians into traditional visual codebook quantization frameworks. Ourapproach addresses the inherent limitations of naive quantization methods suchas VQ-GAN, which struggle to model structured visual information due to theirpatch-based design and emphasis on texture and color. In contrast, VGQ encodesimage latents as 2D Gaussian distributions, effectively capturing geometric andspatial structures by directly modeling structure-related parameters such asposition, rotation and scale. We further demonstrate that increasing thedensity of 2D Gaussians within the tokens leads to significant gains inreconstruction fidelity, providing a flexible trade-off between tokenefficiency and visual richness. On the ImageNet 256x256 benchmark, VGQ achievesstrong reconstruction quality with an rFID score of 1.00. Furthermore, byincreasing the density of 2D Gaussians within the tokens, VGQ gains asignificant boost in reconstruction capability and achieves a state-of-the-artreconstruction rFID score of 0.556 and a PSNR of 24.93, substantiallyoutperforming existing methods. Codes will be released soon.