StatEval: A Comprehensive Benchmark for Large Language Models in Statistics

  • 2025-10-10 16:28:43
  • Yuchen Lu, Run Yang, Yichen Zhang, Shuguang Yu, Runpeng Dai, Ziwei Wang, Jiayi Xiang, Wenxin E, Siran Gao, Xinyao Ruan, Yirui Huang, Chenjing Xi, Haibo Hu, Yueming Fu, Qinglan Yu, Xiaobing Wei, Jiani Gu, Rui Sun, Jiaxuan Jia, Fan Zhou
  • 0

Abstract

Large language models (LLMs) have demonstrated remarkable advances inmathematical and logical reasoning, yet statistics, as a distinct andintegrative discipline, remains underexplored in benchmarking efforts. Toaddress this gap, we introduce \textbf{StatEval}, the first comprehensivebenchmark dedicated to statistics, spanning both breadth and depth acrossdifficulty levels. StatEval consists of 13,817 foundational problems coveringundergraduate and graduate curricula, together with 2374 research-level prooftasks extracted from leading journals. To construct the benchmark, we design ascalable multi-agent pipeline with human-in-the-loop validation that automateslarge-scale problem extraction, rewriting, and quality control, while ensuringacademic rigor. We further propose a robust evaluation framework tailored toboth computational and proof-based tasks, enabling fine-grained assessment ofreasoning ability. Experimental results reveal that while closed-source modelssuch as GPT5-mini achieve below 57\% on research-level problems, withopen-source models performing significantly lower. These findings highlight theunique challenges of statistical reasoning and the limitations of current LLMs.We expect StatEval to serve as a rigorous benchmark for advancing statisticalintelligence in large language models. All data and code are available on ourweb platform: https://stateval.github.io/.

 

Quick Read (beta)

loading the full paper ...