Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

  • 2025-06-13 03:32:48
  • Yuhao Zhou, Yiheng Wang, Xuming He, Ruoyao Xiao, Zhiwei Li, Qiantai Feng, Zijie Guo, Yuejin Yang, Hao Wu, Wenxuan Huang, Jiaqi Wei, Dan Si, Xiuqi Yao, Jia Bu, Haiwen Huang, Tianfan Fu, Shixiang Tang, Ben Fei, Dongzhan Zhou, Fenghua Ling, Yan Lu, Siqi Sun, Chenhui Li, Guanjie Zheng, Jiancheng Lv, Wenlong Zhang, Lei Bai
  • 0

Abstract

Scientific discoveries increasingly rely on complex multimodal reasoningbased on information-intensive scientific data and domain-specific expertise.Empowered by expert-level scientific benchmarks, scientific Multimodal LargeLanguage Models (MLLMs) hold the potential to significantly enhance thisdiscovery process in realistic workflows. However, current scientificbenchmarks mostly focus on evaluating the knowledge understanding capabilitiesof MLLMs, leading to an inadequate assessment of their perception and reasoningabilities. To address this gap, we present the Scientists' First Exam (SFE)benchmark, designed to evaluate the scientific cognitive capacities of MLLMsthrough three interconnected levels: scientific signal perception, scientificattribute understanding, scientific comparative reasoning. Specifically, SFEcomprises 830 expert-verified VQA pairs across three question types, spanning66 multimodal tasks across five high-value disciplines. Extensive experimentsreveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08%and 26.52% on SFE, highlighting significant room for MLLMs to improve inscientific realms. We hope the insights obtained in SFE will facilitate furtherdevelopments in AI-enhanced scientific discoveries.

 

Quick Read (beta)

loading the full paper ...