Explanations of Large Language Models Explain Language Representations in the Brain

Abstract

Large language models (LLMs) not only exhibit human-like performance but alsoshare computational principles with the brain's language processing mechanisms.While prior research has focused on mapping LLMs' internal representations toneural activity, we propose a novel approach using explainable AI (XAI) tostrengthen this link. Applying attribution methods, we quantify the influenceof preceding words on LLMs' next-word predictions and use these explanations topredict fMRI data from participants listening to narratives. We find thatattribution methods robustly predict brain activity across the languagenetwork, revealing a hierarchical pattern: explanations from early layers alignwith the brain's initial language processing stages, while later layerscorrespond to more advanced stages. Additionally, layers with greater influenceon next-word prediction$\unicode{x2014}$reflected in higher attributionscores$\unicode{x2014}$demonstrate stronger brain alignment. These resultsunderscore XAI's potential for exploring the neural basis of language andsuggest brain alignment for assessing the biological plausibility ofexplanation methods.

Quick Read (beta)

loading the full paper ...