Online SFT for LLM Reasoning: Surprising Effectiveness of Self-Tuning without Rewards

Abstract

We present a simple, self-help online supervised finetuning (OSFT) paradigmfor LLM reasoning. In this paradigm, the model generates its own responses andis immediately finetuned on this self-generated data. OSFT is a highlyefficient training strategy for LLM reasoning, as it is reward-free and usesjust one rollout by default. Experiment results show that OSFT achievesdownstream performance on challenging mathematical reasoning tasks comparableto strong reinforcement learning with verifiable rewards (RLVR) methods such asGRPO. Our ablation study further demonstrates the efficiency and robustness ofOSFT. The major mechanism of OSFT lies in facilitating the model's own existingpreference (latent knowledge) learned from pretraining, which leads toreasoning ability improvement. We believe that OSFT offers an efficient andpromising alternative to more complex, reward-based training paradigms. Ourcode is available at https://github.com/ElementQi/OnlineSFT.

Quick Read (beta)

loading the full paper ...