Abstract
Scientific discoveries increasingly rely on complex multimodal reasoningbased on information-intensive scientific data and domain-specific expertise.Empowered by expert-level scientific benchmarks, scientific Multimodal LargeLanguage Models (MLLMs) hold the potential to significantly enhance thisdiscovery process in realistic workflows. However, current scientificbenchmarks mostly focus on evaluating the knowledge understanding capabilitiesof MLLMs, leading to an inadequate assessment of their perception and reasoningabilities. To address this gap, we present the Scientists' First Exam (SFE)benchmark, designed to evaluate the scientific cognitive capacities of MLLMsthrough three interconnected levels: scientific signal perception, scientificattribute understanding, scientific comparative reasoning. Specifically, SFEcomprises 830 expert-verified VQA pairs across three question types, spanning66 multimodal tasks across five high-value disciplines. Extensive experimentsreveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08%and 26.52% on SFE, highlighting significant room for MLLMs to improve inscientific realms. We hope the insights obtained in SFE will facilitate furtherdevelopments in AI-enhanced scientific discoveries.