Theorem Prover as a Judge for Synthetic Data Generation

Abstract

The demand for synthetic data in mathematical reasoning has increased due toits potential to enhance the mathematical capabilities of large language models(LLMs). However, ensuring the validity of intermediate reasoning steps remainsa significant challenge, affecting data quality. While formal verification viatheorem provers effectively validates LLM reasoning, the autoformalisation ofmathematical proofs remains error-prone. In response, we introduce iterativeautoformalisation, an approach that iteratively refines theorem proverformalisation to mitigate errors, thereby increasing the execution rate on theLean prover from 60% to 87%. Building upon that, we introduce Theorem Prover asa Judge (TP-as-a-Judge), a method that employs theorem prover formalisation torigorously assess LLM intermediate reasoning, effectively integratingautoformalisation with synthetic data generation. Finally, we presentReinforcement Learning from Theorem Prover Feedback (RLTPF), a framework thatreplaces human annotation with theorem prover feedback in ReinforcementLearning from Human Feedback (RLHF). Across multiple LLMs, applyingTP-as-a-Judge and RLTPF improves benchmarks with only 3,508 samples, achieving5.56% accuracy gain on Mistral-7B for MultiArith, 6.00% on Llama-2-7B forSVAMP, and 3.55% on Llama-3.1-8B for AQUA.

Quick Read (beta)

loading the full paper ...