SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation

Abstract

Large multimodal models (LMMs) have proven flexible and generalisable acrossmany tasks and fields. Although they have strong potential to aid scientificresearch, their capabilities in this domain are not well characterised. A keyaspect of scientific research is the ability to understand and interpretfigures, which serve as a rich, compressed source of complex information. Inthis work, we present SciFIBench, a scientific figure interpretation benchmark.Our main benchmark consists of a 1000-question gold set of multiple-choicequestions split between two tasks across 12 categories. The questions arecurated from CS arXiv paper figures and captions, using adversarial filteringto find hard negatives and human verification for quality control. We evaluate26 LMMs on SciFIBench, finding it to be a challenging benchmark. Finally, weinvestigate the alignment and reasoning faithfulness of the LMMs on augmentedquestion sets from our benchmark. We release SciFIBench to encourage progressin this domain.

Quick Read (beta)

loading the full paper ...