Abstract
This paper introduces TBAC-UniImage, a novel unified model for multimodalunderstanding and generation. We achieve this by deeply integrating apre-trained Diffusion Model, acting as a generative ladder, with a MultimodalLarge Language Model (MLLM). Previous diffusion-based unified models face twoprimary limitations. One approach uses only the MLLM's final hidden state asthe generative condition. This creates a shallow connection, as the generatoris isolated from the rich, hierarchical representations within the MLLM'sintermediate layers. The other approach, pretraining a unified generativearchitecture from scratch, is computationally expensive and prohibitive formany researchers. To overcome these issues, our work explores a new paradigm.Instead of relying on a single output, we use representations from multiple,diverse layers of the MLLM as generative conditions for the diffusion model.This method treats the pre-trained generator as a ladder, receiving guidancefrom various depths of the MLLM's understanding process. Consequently,TBAC-UniImage achieves a much deeper and more fine-grained unification ofunderstanding and generation.