Abstract
Supervised fine-tuning (SFT) of large language models can be viewed as anoff-policy learning problem, where expert demonstrations come from a fixedbehavior policy while training aims to optimize a target policy. Importancesampling is the standard tool for correcting this distribution mismatch, butlarge policy gaps lead to skewed weights, high variance, and unstableoptimization. Existing methods mitigate this issue with KL penalties orclipping, which passively restrict updates rather than actively reducing thegap. We propose a simple yet effective data rewriting framework thatproactively shrinks the policy gap before training. For each problem, correctmodel-generated solutions are kept as on-policy data, while incorrect ones arerewritten through guided re-solving, falling back to expert demonstrations onlywhen needed. This aligns the training distribution with the target policy,reducing variance and improving stability. To handle residual mismatch afterrewriting, we additionally apply importance sampling during training, forming atwo-stage approach that combines data-level alignment with lightweightoptimization-level correction. Experiments on five mathematical reasoningbenchmarks show consistent and significant gains over both vanilla SFT and thestate-of-the-art Dynamic Fine-Tuning (DFT) approach. Data and code will bereleased at https://github.com/NKU-HLT/Off-Policy-SFT.