TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

Abstract

Pioneering token-based works such as Chameleon and Emu3 have established afoundation for multimodal unification but face challenges of high trainingcomputational overhead and limited comprehension performance due to a lack ofhigh-level semantics. In this paper, we introduce TokLIP, a visual tokenizerthat enhances comprehension by semanticizing vector-quantized (VQ) tokens andincorporating CLIP-level semantics while enabling end-to-end multimodalautoregressive training with standard VQ tokens. TokLIP integrates a low-leveldiscrete VQ tokenizer with a ViT-based token encoder to capture high-levelcontinuous semantics. Unlike previous approaches (e.g., VILA-U) that discretizehigh-level features, TokLIP disentangles training objectives for comprehensionand generation, allowing the direct application of advanced VQ tokenizerswithout the need for tailored quantization operations. Our empirical resultsdemonstrate that TokLIP achieves exceptional data efficiency, empowering visualtokens with high-level semantic understanding while enhancing low-levelgenerative capacity, making it well-suited for autoregressive Transformers inboth comprehension and generation tasks. The code and models are available athttps://github.com/TencentARC/TokLIP.

Quick Read (beta)

loading the full paper ...