DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

Abstract

The differing representation spaces required for visual understanding andgeneration pose a challenge in unifying them within the autoregressive paradigmof large language models. A vision tokenizer trained for reconstruction excelsat capturing low-level perceptual details, making it well-suited for visualgeneration but lacking high-level semantic representations for understandingtasks. Conversely, a vision encoder trained via contrastive learning alignswell with language but struggles to decode back into the pixel space forgeneration tasks. To bridge this gap, we propose DualToken, a method thatunifies representations for both understanding and generation within a singletokenizer. However, directly integrating reconstruction and semantic objectivesin a single tokenizer creates conflicts, leading to degraded performance inboth reconstruction quality and semantic performance. Instead of forcing asingle codebook to handle both semantic and perceptual information, DualTokendisentangles them by introducing separate codebooks for high and low-levelfeatures, effectively transforming their inherent conflict into a synergisticrelationship. As a result, DualToken achieves state-of-the-art performance inboth reconstruction and semantic tasks while demonstrating remarkableeffectiveness in downstream MLLM understanding and generation tasks. Notably,we also show that DualToken, as a unified tokenizer, surpasses the naivecombination of two distinct types vision encoders, providing superiorperformance within a unified MLLM.

Quick Read (beta)

loading the full paper ...