Step-wise Adaptive Integration of Supervised Fine-tuning and Reinforcement Learning for Task-Specific LLMs

Abstract

Large language models (LLMs) excel at mathematical reasoning and logicalproblem-solving. The current popular training paradigms primarily usesupervised fine-tuning (SFT) and reinforcement learning (RL) to enhance themodels' reasoning abilities. However, when using SFT or RL alone, there arerespective challenges: SFT may suffer from overfitting, while RL is prone tomode collapse. The state-of-the-art methods have proposed hybrid trainingschemes. However, static switching faces challenges such as poor generalizationacross different tasks and high dependence on data quality. In response tothese challenges, inspired by the curriculum learning-quiz mechanism in humanreasoning cultivation, We propose SASR, a step-wise adaptive hybrid trainingframework that theoretically unifies SFT and RL and dynamically balances thetwo throughout optimization. SASR uses SFT for initial warm-up to establishbasic reasoning skills, and then uses an adaptive dynamic adjustment algorithmbased on gradient norm and divergence relative to the original distribution toseamlessly integrate SFT with the online RL method GRPO. By monitoring thetraining status of LLMs and adjusting the training process in sequence, SASRensures a smooth transition between training schemes, maintaining corereasoning abilities while exploring different paths. Experimental resultsdemonstrate that SASR outperforms SFT, RL, and static hybrid training methods.

Quick Read (beta)

loading the full paper ...