Abstract
Effective multimodal reasoning depends on the alignment of visual andlinguistic representations, yet the mechanisms by which vision-language models(VLMs) achieve this alignment remain poorly understood. We introduce amethodological framework that deliberately maintains a frozen large languagemodel (LLM) and a frozen vision transformer (ViT), connected solely by traininga linear adapter during visual instruction tuning. This design is fundamentalto our approach: by keeping the language model frozen, we ensure it maintainsits original language representations without adaptation to visual data.Consequently, the linear adapter must map visual features directly into theLLM's existing representational space rather than allowing the language modelto develop specialized visual understanding through fine-tuning. Ourexperimental design uniquely enables the use of pre-trained sparse autoencoders(SAEs) of the LLM as analytical probes. These SAEs remain perfectly alignedwith the unchanged language model and serve as a snapshot of the learnedlanguage feature-representations. Through systematic analysis of SAEreconstruction error, sparsity patterns, and feature SAE descriptions, wereveal the layer-wise progression through which visual representationsgradually align with language feature representations, converging inmiddle-to-later layers. This suggests a fundamental misalignment between ViToutputs and early LLM layers, raising important questions about whether currentadapter-based architectures optimally facilitate cross-modal representationlearning.