Bridging Supervised Learning and Reinforcement Learning in Math Reasoning

Abstract

Reinforcement Learning (RL) has played a central role in the recent surge ofLLMs' math abilities by enabling self-improvement through binary verifiersignals. In contrast, Supervised Learning (SL) is rarely considered for suchverification-driven training, largely due to its heavy reliance on referenceanswers and inability to reflect on mistakes. In this work, we challenge theprevailing notion that self-improvement is exclusive to RL and proposeNegative-aware Fine-Tuning (NFT) -- a supervised approach that enables LLMs toreflect on their failures and improve autonomously with no external teachers.In online training, instead of throwing away self-generated negative answers,NFT constructs an implicit negative policy to model them. This implicit policyis parameterized with the same positive LLM we target to optimize on positivedata, enabling direct policy optimization on all LLMs' generations. We conductexperiments on 7B and 32B models in math reasoning tasks. Results consistentlyshow that through the additional leverage of negative feedback, NFTsignificantly improves over SL baselines like Rejection sampling Fine-Tuning,matching or even surpassing leading RL algorithms like GRPO and DAPO.Furthermore, we demonstrate that NFT and GRPO are actually equivalent instrict-on-policy training, even though they originate from entirely differenttheoretical foundations. Our experiments and theoretical findings bridge thegap between SL and RL methods in binary-feedback learning systems.

Quick Read (beta)

loading the full paper ...