Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Abstract

Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progressin text-driven visual generation. However, even state-of-the-art MM-DiT modelslike FLUX struggle with achieving precise alignment between text prompts andgenerated content. We identify two key issues in the attention mechanism ofMM-DiT, namely 1) the suppression of cross-modal attention due to tokenimbalance between visual and textual modalities and 2) the lack oftimestep-aware attention weighting, which hinder the alignment. To addressthese issues, we propose \textbf{Temperature-Adjusted Cross-modal Attention(TACA)}, a parameter-efficient method that dynamically rebalances multimodalinteractions through temperature scaling and timestep-dependent adjustment.When combined with LoRA fine-tuning, TACA significantly enhances text-imagealignment on the T2I-CompBench benchmark with minimal computational overhead.We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstratingits ability to improve image-text alignment in terms of object appearance,attribute binding, and spatial relationships. Our findings highlight theimportance of balancing cross-modal attention in improving semantic fidelity intext-to-image diffusion models. Our codes are publicly available at\href{https://github.com/Vchitect/TACA}

Quick Read (beta)

loading the full paper ...