Abstract
We propose X-Fusion, a framework that extends pretrained Large LanguageModels (LLMs) for multimodal tasks while preserving their languagecapabilities. X-Fusion employs a dual-tower design with modality-specificweights, keeping the LLM's parameters frozen while integrating vision-specificinformation for both understanding and generation. Our experiments demonstratethat X-Fusion consistently outperforms alternative architectures on bothimage-to-text and text-to-image tasks. We find that incorporatingunderstanding-focused data improves generation quality, reducing image datanoise enhances overall performance, and feature alignment acceleratesconvergence for smaller models but has minimal impact on larger ones. Ourfindings provide valuable insights into building efficient unified multimodalmodels.