Abstract
Visual tokenizer is a critical component for vision generation. However, theexisting tokenizers often face unsatisfactory trade-off between compressionratios and reconstruction fidelity. To fill this gap, we introduce a powerfuland concise WeTok tokenizer, which surpasses the previous leading tokenizersvia two core innovations. (1) Group-wise lookup-free Quantization (GQ). Wepartition the latent features into groups, and perform lookup-free quantizationfor each group. As a result, GQ can efficiently overcome memory and computationlimitations of prior tokenizers, while achieving a reconstruction breakthroughwith more scalable codebooks. (2) Generative Decoding (GD). Different fromprior tokenizers, we introduce a generative decoder with a prior of extra noisevariable. In this case, GD can probabilistically model the distribution ofvisual data conditioned on discrete tokens, allowing WeTok to reconstructvisual details, especially at high compression ratios. Extensive experiments onmainstream benchmarks show superior performance of our WeTok. On the ImageNet50k validation set, WeTok achieves a record-low zero-shot rFID (WeTok: 0.12 vs.FLUX-VAE: 0.18 vs. SD-VAE 3.5: 0.19). Furthermore, our highest compressionmodel achieves a zero-shot rFID of 3.49 with a compression ratio of 768,outperforming Cosmos (384) 4.57 which has only 50% compression rate of ours.Code and models are available: https://github.com/zhuangshaobin/WeTok.