SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Abstract

Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely usedpost-training techniques for foundation models. However, their roles inenhancing model generalization capabilities remain unclear. This paper studiesthe difference between SFT and RL on generalization and memorization, focusingon text-based rule variants and visual variants. We introduce GeneralPoints, anarithmetic reasoning card game, and adopt V-IRL, a real-world navigationenvironment, to assess how models trained with SFT and RL generalize to unseenvariants in both textual and visual domains. We show that RL, especially whentrained with an outcome-based reward, generalizes across both rule-basedtextual and visual variants. SFT, in contrast, tends to memorize training dataand struggles to generalize out-of-distribution scenarios. Further analysisreveals that RL improves the model's underlying visual recognitioncapabilities, contributing to its enhanced generalization in the visual domain.Despite RL's superior generalization, we show that SFT remains essential foreffective RL training; SFT stabilizes the model's output format, enablingsubsequent RL to achieve its performance gains. These findings demonstrates thecapability of RL for acquiring generalizable knowledge in complex, multi-modaltasks.

Quick Read (beta)

loading the full paper ...