Abstract
Large language models (LLMs) like GPT-4, PaLM, and LLaMA have shownsignificant improvements in various reasoning tasks. However, smaller modelssuch as Llama-3-8B and DeepSeekMath-Base still struggle with complexmathematical reasoning because they fail to effectively identify and correctreasoning errors. Recent reflection-based methods aim to address these issuesby enabling self-reflection and self-correction, but they still face challengesin independently detecting errors in their reasoning steps. To overcome theselimitations, we propose SuperCorrect, a novel two-stage framework that uses alarge teacher model to supervise and correct both the reasoning and reflectionprocesses of a smaller student model. In the first stage, we extracthierarchical high-level and detailed thought templates from the teacher modelto guide the student model in eliciting more fine-grained reasoning thoughts.In the second stage, we introduce cross-model collaborative direct preferenceoptimization (DPO) to enhance the self-correction abilities of the studentmodel by following the teacher's correction traces during training. Thiscross-model DPO approach teaches the student model to effectively locate andresolve erroneous thoughts with error-driven insights from the teacher model,breaking the bottleneck of its thoughts and acquiring new skills and knowledgeto tackle challenging problems. Extensive experiments consistently demonstrateour superiority over previous methods. Notably, our SuperCorrect-7B modelsignificantly surpasses powerful DeepSeekMath-7B by 7.8%/5.3% andQwen2.5-Math-7B by 15.1%/6.3% on MATH/GSM8K benchmarks, achieving new SOTAperformance among all 7B models. Code:https://github.com/YangLing0818/SuperCorrect-llm