Modelling Multimodal Integration in Human Concept Processing with Vision-Language Models

Abstract

Text representations from language models have proven remarkably predictiveof human neural activity involved in language processing, with the recenttransformer-based models outperforming previous architectures in downstreamtasks and prediction of brain responses. However, the word representationslearnt by language-only models may be limited in that they lack sensoryinformation from other modalities, which several cognitive and neurosciencestudies showed to be reflected in human meaning representations. Here, weleverage current pre-trained vision-language models (VLMs) to investigatewhether the integration of visuo-linguistic information they operate leads torepresentations that are more aligned with human brain activity than thoseobtained by models trained with language-only input. We focus on fMRI responsesrecorded while participants read concept words in the context of either a fullsentence or a picture. Our results reveal that VLM representations correlatemore strongly than those by language-only models with activations in brainareas functionally related to language processing. Additionally, we find thattransformer-based vision-language encoders -- e.g., LXMERT and VisualBERT --yield more brain-aligned representations than generative VLMs, whoseautoregressive abilities do not seem to provide an advantage when modellingsingle words. Finally, our ablation analyses suggest that the high brainalignment achieved by some of the VLMs we evaluate results from semanticinformation acquired specifically during multimodal pretraining as opposed tobeing already encoded in their unimodal modules. Altogether, our findingsindicate an advantage of multimodal models in predicting human brainactivations, which reveals that modelling language and vision integration hasthe potential to capture the multimodal nature of human conceptrepresentations.

Quick Read (beta)

loading the full paper ...