Abstract
The recent advancements in auto-regressive multimodal large language models(MLLMs) have demonstrated promising progress for vision-language tasks. Whilethere exists a variety of studies investigating the processing of linguisticinformation within large language models, little is currently known about theinner working mechanism of MLLMs and how linguistic and visual informationinteract within these models. In this study, we aim to fill this gap byexamining the information flow between different modalities -- language andvision -- in MLLMs, focusing on visual question answering. Specifically, givenan image-question pair as input, we investigate where in the model and how thevisual and linguistic information are combined to generate the finalprediction. Conducting experiments with a series of models from the LLaVAseries, we find that there are two distinct stages in the process ofintegration of the two modalities. In the lower layers, the model firsttransfers the more general visual features of the whole image into therepresentations of (linguistic) question tokens. In the middle layers, it onceagain transfers visual information about specific objects relevant to thequestion to the respective token positions of the question. Finally, in thehigher layers, the resulting multimodal representation is propagated to thelast position of the input sequence for the final prediction. Overall, ourfindings provide a new and comprehensive perspective on the spatial andfunctional aspects of image and language processing in the MLLMs, therebyfacilitating future research into multimodal information localization andediting.