Abstract
We present ILLUME+ that leverages dual visual tokenization and a diffusiondecoder to improve both deep semantic understanding and high-fidelity imagegeneration. Existing unified models have struggled to simultaneously handle thethree fundamental capabilities in a unified model: understanding, generation,and editing. Models like Chameleon and EMU3 utilize VQGAN for imagediscretization, due to the lack of deep semantic interaction, they lag behindspecialist models like LLaVA in visual understanding tasks. To mitigate this,LaViT and ILLUME employ semantic encoders for tokenization, but they strugglewith image editing due to poor texture preservation. Meanwhile, Janus seriesdecouples the input and output image representation, limiting their abilitiesto seamlessly handle interleaved image-text understanding and generation. Incontrast, ILLUME+ introduces a unified dual visual tokenizer, DualViTok, whichpreserves both fine-grained textures and text-aligned semantics while enablinga coarse-to-fine image representation strategy for multimodal understanding andgeneration. Additionally, we employ a diffusion model as the image detokenizerfor enhanced generation quality and efficient super-resolution. ILLUME+ followsa continuous-input, discrete-output scheme within the unified MLLM and adopts aprogressive training procedure that supports dynamic resolution across thevision tokenizer, MLLM, and diffusion decoder. This design allows for flexibleand efficient context-aware image editing and generation across diverse tasks.ILLUME+ (3B) exhibits competitive performance against existing unified MLLMsand specialized models across multimodal understanding, generation, and editingbenchmarks. With its strong performance, ILLUME+ provides a scalable andversatile foundation for future multimodal applications. Project Page:https://illume-unified-mllm.github.io/.