Lean Workbook: A large-scale Lean problem set formalized from natural language math problems

  • 2025-06-18 17:07:47
  • Huaiyuan Ying, Zijian Wu, Yihan Geng, Zheng Yuan, Dahua Lin, Kai Chen
  • 0

Abstract

Large language models have demonstrated impressive capabilities acrossvarious natural language processing tasks, especially in solving mathematicalproblems. However, large language models are not good at math theorem provingusing formal languages like Lean. A significant challenge in this area is thescarcity of training data available in these formal languages. To address thisissue, we propose a novel pipeline that iteratively generates and filterssynthetic data to translate natural language mathematical problems into Lean 4statements, and vice versa. Our results indicate that the synthetic datapipeline can provide useful training data and improve the performance of LLMsin translating and understanding complex mathematical problems and proofs. Ourfinal dataset contains about 57K formal-informal question pairs along withsearched proof from the math contest forum and 21 new IMO questions. Weopen-source our code at https://github.com/InternLM/InternLM-Math and our dataat https://huggingface.co/datasets/InternLM/Lean-Workbook.

 

Quick Read (beta)

loading the full paper ...