CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

  • 2025-08-25 15:32:22
  • Weida Wang, Dongchen Huang, Jiatong Li, Tengchao Yang, Ziyang Zheng, Di Zhang, Dong Han, Benteng Chen, Binzhao Luo, Zhiyu Liu, Kunling Liu, Zhiyuan Gao, Shiqi Geng, Wei Ma, Jiaming Su, Xin Li, Shuchen Pu, Yuhan Shui, Qianjia Cheng, Zhihao Dou, Dongfei Cui, Changyong He, Jin Zeng, Zeke Xie, Mao Su, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang, Lei Bai, Yunqi Cai, Xi Dai, Shufei Zhang, Jinguang Cheng, Zhong Fang, Hongming Weng
  • 0

Abstract

We introduce CMPhysBench, designed to assess the proficiency of LargeLanguage Models (LLMs) in Condensed Matter Physics, as a novel Benchmark.CMPhysBench is composed of more than 520 graduate-level meticulously curatedquestions covering both representative subfields and foundational theoreticalframeworks of condensed matter physics, such as magnetism, superconductivity,strongly correlated systems, etc. To ensure a deep understanding of theproblem-solving process,we focus exclusively on calculation problems, requiringLLMs to independently generate comprehensive solutions. Meanwhile, leveragingtree-based representations of expressions, we introduce the Scalable ExpressionEdit Distance (SEED) score, which provides fine-grained (non-binary) partialcredit and yields a more accurate assessment of similarity between predictionand ground-truth. Our results show that even the best models, Grok-4, reachonly 36 average SEED score and 28% accuracy on CMPhysBench, underscoring asignificant capability gap, especially for this practical and frontier domainrelative to traditional physics. The code anddataset are publicly available athttps://github.com/CMPhysBench/CMPhysBench.

 

Quick Read (beta)

loading the full paper ...