Abstract
Large language models (LLMs) have demonstrated remarkable advances inmathematical and logical reasoning, yet statistics, as a distinct andintegrative discipline, remains underexplored in benchmarking efforts. Toaddress this gap, we introduce \textbf{StatEval}, the first comprehensivebenchmark dedicated to statistics, spanning both breadth and depth acrossdifficulty levels. StatEval consists of 13,817 foundational problems coveringundergraduate and graduate curricula, together with 2374 research-level prooftasks extracted from leading journals. To construct the benchmark, we design ascalable multi-agent pipeline with human-in-the-loop validation that automateslarge-scale problem extraction, rewriting, and quality control, while ensuringacademic rigor. We further propose a robust evaluation framework tailored toboth computational and proof-based tasks, enabling fine-grained assessment ofreasoning ability. Experimental results reveal that while closed-source modelssuch as GPT5-mini achieve below 57\% on research-level problems, withopen-source models performing significantly lower. These findings highlight theunique challenges of statistical reasoning and the limitations of current LLMs.We expect StatEval to serve as a rigorous benchmark for advancing statisticalintelligence in large language models. All data and code are available on ourweb platform: https://stateval.github.io/.