Abstract
The Algonauts 2025 Challenge called on the community to develop encodingmodels that predict whole-brain fMRI responses to naturalistic multimodalmovies. In this submission, we propose a sequence-to-sequence Transformer thatautoregressively predicts fMRI activity from visual, auditory, and languageinputs. Stimulus features were extracted using pretrained models includingVideoMAE, HuBERT, Qwen, and BridgeTower. The decoder integrates informationfrom prior brain states, current stimuli, and episode-level summaries via dualcross-attention mechanisms that attend to both perceptual information extractedfrom the stimulus as well as narrative information provided by high-levelsummaries of narrative content. One core innovation of our approach is the useof sequences of multimodal context to predict sequences of brain activity,enabling the model to capture long-range temporal structure in both stimuli andneural responses. Another is the combination of a shared encoder with partialsubject-specific decoder, which leverages common structure across subjectswhile accounting for individual variability. Our model achieves strongperformance on both in-distribution and out-of-distribution data, demonstratingthe effectiveness of temporally-aware, multimodal sequence modeling for brainactivity prediction. The code is available athttps://github.com/Angelneer926/Algonauts_challenge.