Speech language models lack important brain-relevant semantics

Abstract

Despite known differences between reading and listening in the brain, recentwork has shown that text-based language models predict both text-evoked andspeech-evoked brain activity to an impressive degree. This poses the questionof what types of information language models truly predict in the brain. Weinvestigate this question via a direct approach, in which we eliminateinformation related to specific low-level stimulus features (textual, speech,and visual) in the language model representations, and observe how thisintervention affects the alignment with fMRI brain recordings acquired whileparticipants read versus listened to the same naturalistic stories. We furthercontrast our findings with speech-based language models, which would beexpected to predict speech-evoked brain activity better, provided they modellanguage processing in the brain well. Using our direct approach, we find thatboth text-based and speech-based language models align well with early sensoryregions due to shared low-level features. Text-based models continue to alignwell with later language regions even after removing these features, while,surprisingly, speech-based models lose most of their alignment. These findingssuggest that speech-based models can be further improved to better reflectbrain-like language processing.

Quick Read (beta)

loading the full paper ...