Abstract
Large language models (LLMs) not only exhibit human-like performance but alsoshare computational principles with the brain's language processing mechanisms.While prior research has focused on mapping LLMs' internal representations toneural activity, we propose a novel approach using explainable AI (XAI) tostrengthen this link. Applying attribution methods, we quantify the influenceof preceding words on LLMs' next-word predictions and use these explanations topredict fMRI data from participants listening to narratives. We find thatattribution methods robustly predict brain activity across the languagenetwork, revealing a hierarchical pattern: explanations from early layers alignwith the brain's initial language processing stages, while later layerscorrespond to more advanced stages. Additionally, layers with greater influenceon next-word prediction$\unicode{x2014}$reflected in higher attributionscores$\unicode{x2014}$demonstrate stronger brain alignment. These resultsunderscore XAI's potential for exploring the neural basis of language andsuggest brain alignment for assessing the biological plausibility ofexplanation methods.