Abstract
The rapid advancement of Large Multi-modal Models (LMMs) has enabled theirapplication in scientific problem-solving, yet their fine-grained capabilitiesremain under-explored. In this paper, we introduce SciVerse, a multi-modalscientific evaluation benchmark to thoroughly assess LMMs across 5,735 testinstances in five distinct versions. We aim to investigate three key dimensionsof LMMs: scientific knowledge comprehension, multi-modal contentinterpretation, and Chain-of-Thought (CoT) reasoning. To unveil whether LMMspossess sufficient scientific expertise, we first transform each problem intothree versions containing different levels of knowledge required for solving,i.e., Knowledge-free, -lite, and -rich. Then, to explore how LMMs interpretmulti-modal scientific content, we annotate another two versions, i.e.,Vision-rich and -only, marking more question information from texts todiagrams. Comparing the results of different versions, SciVerse systematicallyexamines the professional knowledge stock and visual perception skills of LMMsin scientific domains. In addition, to rigorously assess CoT reasoning, wepropose a new scientific CoT evaluation strategy, conducting a step-wiseassessment on knowledge and logical errors in model outputs. Our extensiveevaluation of different LMMs on SciVerse reveals critical limitations in theirscientific proficiency and provides new insights into future developments.Project page: https://sciverse-cuhk.github.io