Proof2Hybrid: Automatic Mathematical Benchmark Synthesis for Proof-Centric Problems

Abstract

Evaluating the mathematical capability of Large Language Models (LLMs) is acritical yet challenging frontier. Existing benchmarks fall short, particularlyfor proof-centric problems, as manual creation is unscalable and costly,leaving the true mathematical abilities of LLMs largely unassessed. To overcomethese barriers, we propose Proof2Hybrid, the first fully automated frameworkthat synthesizes high-quality, proof-centric benchmarks from natural languagemathematical corpora. The key novelty of our solution is Proof2X, a roadmap ofconverting mathematical proofs into various kinds of questions that are easy toverify. Instructed by this roadmap, we propose a new type of hybrid-formattedquestions, named ``$m$-out-of-$n$ multiple judge questions'', specificallydesigned to enable robust, automatic evaluation while being resilient toguessing and superficial pattern matching inherent in traditional formats. As ademonstration of our framework, we introduce AlgGeoTest, a benchmark foralgebraic geometry--a frontier domain of modern mathematics--comprising 456challenging items. Our extensive evaluations on state-of-the-art LLMs usingAlgGeoTest reveal profound deficits in their comprehension of algebraicgeometry, providing a more precise measure of their true mathematicalcapabilities. Our framework and benchmark pave the way for a new wave ofin-depth research into the mathematical intelligence of AI systems.

Quick Read (beta)

loading the full paper ...