Abstract
Decoding non-invasive brain recordings is crucial for advancing ourunderstanding of human cognition, yet faces challenges from individualdifferences and complex neural signal representations. Traditional methodsrequire custom models and extensive trials, and lack interpretability in visualreconstruction tasks. Our framework integrating integrates 3D brain structureswith visual semantics by Vision Transformer 3D. The unified feature extractoraligns fMRI features with multiple levels of visual embeddings efficiently,removing the need for individual-specific models and allowing extraction fromsingle-trial data. This extractor consolidates multi-level visual features intoone network, simplifying integration with Large Language Models (LLMs).Additionally, we have enhanced the fMRI dataset with various fMRI-image relatedtextual data to support multimodal large model development. The integrationwith LLMs enhances decoding capabilities, enabling tasks like brain captioning,question-answering, detailed descriptions, complex reasoning, and visualreconstruction. Our approach not only shows superior performance across thesetasks but also precisely identifies and manipulates language-based conceptswithin brain signals, enhancing interpretability and providing deeper neuralprocess insights. These advances significantly broaden non-invasive braindecoding applicability in neuroscience and human-computer interaction, settingthe stage for advanced brain-computer interfaces and cognitive models.