Abstract
We present Visual Lexicon, a novel visual language that encodes rich imageinformation into the text space of vocabulary tokens while retaining intricatevisual details that are often challenging to convey in natural language. Unliketraditional methods that prioritize either high-level semantics (e.g., CLIP) orpixel-level reconstruction (e.g., VAE), ViLex simultaneously captures richsemantic content and fine visual details, enabling high-quality imagegeneration and comprehensive visual scene understanding. Through aself-supervised learning pipeline, ViLex generates tokens optimized forreconstructing input images using a frozen text-to-image (T2I) diffusion model,preserving the detailed information necessary for high-fidelity semantic-levelreconstruction. As an image embedding in the language space, ViLex tokensleverage the compositionality of natural languages, allowing them to be usedindependently as "text tokens" or combined with natural language tokens toprompt pretrained T2I models with both visual and textual inputs, mirroring howwe interact with vision-language models (VLMs). Experiments demonstrate thatViLex achieves higher fidelity in image reconstruction compared to textembeddings--even with a single ViLex token. Moreover, ViLex successfullyperforms various DreamBooth tasks in a zero-shot, unsupervised manner withoutfine-tuning T2I models. Additionally, ViLex serves as a powerful visionencoder, consistently improving vision-language model performance across 15benchmarks relative to a strong SigLIP baseline.