Abstract
The Algonauts 2025 Challenge called on the community to develop encodingmodels that predict whole-brain fMRI responses to naturalistic multimodalmovies. In this submission, we propose a sequence-to-sequence Transformer thatautoregressively predicts fMRI activity from visual, auditory, and languageinputs. Stimulus features were extracted using pretrained models includingVideoMAE, HuBERT, Qwen, and BridgeTower. The decoder integrates informationfrom prior brain states and current stimuli via dual cross-attention mechanismsthat attend to both perceptual information extracted from the stimulus as wellas narrative information provided by high-level summaries of the content. Onecore innovation of our approach is the use of sequences of multimodal contextto predict sequences of brain activity, enabling the model to capturelong-range temporal structure in both stimuli and neural responses. Another isthe combination of a shared encoder with partial subject-specific decoder,which leverages common representational structure across subjects whileaccounting for individual variability. Our model achieves strong performance onboth in-distribution and out-of-distribution data, demonstrating theeffectiveness of temporally-aware, multimodal sequence modeling for brainactivity prediction. The code is available athttps://github.com/Angelneer926/Algonauts_challenge.