On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

Abstract

We present a simple yet theoretically motivated improvement to SupervisedFine-Tuning (SFT) for the Large Language Model (LLM), addressing its limitedgeneralization compared to reinforcement learning (RL). Through mathematicalanalysis, we reveal that standard SFT gradients implicitly encode a problematicreward structure that may severely restrict the generalization capabilities ofmodel. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizinggradient updates for each token by dynamically rescaling the objective functionwith the probability of this token. Remarkably, this single-line code changesignificantly outperforms standard SFT across multiple challenging benchmarksand base models, demonstrating greatly improved generalization. Additionally,our approach shows competitive results in offline RL settings, offering aneffective yet simpler alternative. This work bridges theoretical insight andpractical solutions, substantially advancing SFT performance. The code will beavailable at https://github.com/yongliang-wu/DFT.

Quick Read (beta)

loading the full paper ...