CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

Abstract

We introduce CMPhysBench, designed to assess the proficiency of LargeLanguage Models (LLMs) in Condensed Matter Physics, as a novel Benchmark.CMPhysBench is composed of more than 520 graduate-level meticulously curatedquestions covering both representative subfields and foundational theoreticalframeworks of condensed matter physics, such as magnetism, superconductivity,strongly correlated systems, etc. To ensure a deep understanding of theproblem-solving process,we focus exclusively on calculation problems, requiringLLMs to independently generate comprehensive solutions. Meanwhile, leveragingtree-based representations of expressions, we introduce the Scalable ExpressionEdit Distance (SEED) score, which provides fine-grained (non-binary) partialcredit and yields a more accurate assessment of similarity between predictionand ground-truth. Our results show that even the best models, Grok-4, reachonly 36 average SEED score and 28% accuracy on CMPhysBench, underscoring asignificant capability gap, especially for this practical and frontier domainrelative to traditional physics. The code anddataset are publicly available athttps://github.com/CMPhysBench/CMPhysBench.

Quick Read (beta)

loading the full paper ...