Abstract
The advanced language processing abilities of large language models (LLMs)have stimulated debate over their capacity to replicate human-like cognitiveprocesses. One differentiating factor between language processing in LLMs andhumans is that language input is often grounded in several perceptualmodalities, whereas most LLMs process solely text-based information. Multimodalgrounding allows humans to integrate - e.g. visual context with linguisticinformation and thereby place constraints on the space of upcoming words,reducing cognitive load and improving comprehension. Recent multimodal LLMs(mLLMs) combine a visual-linguistic embedding space with a transformer typeattention mechanism for next-word prediction. Here we ask whether predictivelanguage processing based on multimodal input in mLLMs aligns with humans.Two-hundred participants watched short audio-visual clips and estimatedpredictability of an upcoming verb or noun. The same clips were processed bythe mLLM CLIP, with predictability scores based on comparing image and textfeature vectors. Eye-tracking was used to estimate what visual featuresparticipants attended to, and CLIP's visual attention weights were recorded. Wefind that alignment of predictability scores was driven by multimodality ofCLIP (no alignment for a unimodal state-of-the-art LLM) and by the attentionmechanism (no alignment when attention weights were perturbated or when thesame input was fed to a multimodal model without attention). We further find asignificant spatial overlap between CLIP's visual attention weights and humaneye-tracking data. Results suggest that comparable processes of integratingmultimodal information, guided by attention to relevant visual features,supports predictive language processing in mLLMs and humans.