Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning

Abstract

Reinforcement learning (RL) has proven effective in incentivizing thereasoning abilities of large language models (LLMs), but suffers from severeefficiency challenges due to its trial-and-error nature. While the commonpractice employs supervised fine-tuning (SFT) as a warm-up stage for RL, thisdecoupled two-stage approach limits interaction between SFT and RL, therebyconstraining overall effectiveness. This study introduces a novel method forlearning reasoning models that employs bilevel optimization to facilitatebetter cooperation between these training paradigms. By conditioning the SFTobjective on the optimal RL policy, our approach enables SFT to meta-learn howto guide RL's optimization process. During training, the lower level performsRL updates while simultaneously receiving SFT supervision, and the upper levelexplicitly maximizes the cooperative gain-the performance advantage of jointSFT-RL training over RL alone. Empirical evaluations on five reasoningbenchmarks demonstrate that our method consistently outperforms baselines andachieves a better balance between effectiveness and efficiency.

Quick Read (beta)

loading the full paper ...