Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkablecapabilities by integrating visual and textual inputs, yet modality alignmentremains one of the most challenging aspects. Current MLLMs typically rely onsimple adapter architectures and pretraining approaches to bridge visionencoders with large language models (LLM), guided by image-level supervision.We identify this paradigm often leads to suboptimal alignment betweenmodalities, significantly constraining the LLM's ability to properly interpretand reason with visual features particularly for smaller language models. Thislimitation degrades overall performance-particularly for smaller languagemodels where capacity constraints are more pronounced and adaptationcapabilities are limited. To address this fundamental limitation, we proposeSupervised Embedding Alignment (SEA), a token-level supervision alignmentmethod that enables more precise visual-text alignment during pretraining. SEAintroduces minimal computational overhead while preserving languagecapabilities and substantially improving cross-modal understanding. Ourcomprehensive analyses reveal critical insights into the adapter's role inmultimodal integration, and extensive experiments demonstrate that SEAconsistently improves performance across various model sizes, with smallermodels benefiting the most (average performance gain of 7.61% for Gemma-2B).This work establishes a foundation for developing more effective alignmentstrategies for future multimodal systems.