Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved)

Abstract

Behavior Cloning (BC) on curated (or filtered) data is the predominantparadigm for supervised fine-tuning (SFT) of large language models; as well asfor imitation learning of control policies. Here, we draw on a connectionbetween this successful strategy and the theory and practice of finding optimalpolicies via Reinforcement Learning (RL). Building on existing literature, weclarify that SFT can be understood as maximizing a lower bound on the RLobjective in a sparse reward setting. Giving support to its often observed goodperformance. From this viewpoint, we realize that a small modification to SFTleads to an importance weighted variant that behaves closer to training with RLas it: i) optimizes a tighter bound to the RL objective and, ii) can improveperformance compared to SFT on curated data. We refer to this variant asimportance weighted supervised fine-tuning (iw-SFT). We show that it is easy toimplement and can be further generalized to training with quality scored data.The resulting SFT variants are competitive with more advanced RL algorithms forlarge language models and for training policies in continuous control tasks.For example achieving 66.7% on the AIME 2024 dataset.

Quick Read (beta)

loading the full paper ...