Reinforcement Learning Teachers of Test Time Scaling

Abstract

Training reasoning language models (LMs) with reinforcement learning (RL) forone-hot correctness inherently relies on the LM being able to explore and solveits task with some chance at initialization. Furthermore, a key use case ofreasoning LMs is to act as teachers for distilling new students andcold-starting future RL iterations rather than being deployed themselves. Fromthese considerations, we introduce a new framework that avoids RL's explorationchallenge by training a new class of Reinforcement-Learned Teachers (RLTs)focused on yielding the most effective downstream distillation. RLTs areprompted with both the question and solution to each problem, and tasked tosimply "connect-the-dots" with detailed explanations tailored for theirstudents. We train RLTs with dense rewards obtained by feeding each explanationto the student and testing its understanding of the problem's solution. Inpractice, the raw outputs of a 7B RLT provide higher final performance oncompetition and graduate-level tasks than existing distillation andcold-starting pipelines that collect and postprocess the reasoning traces oforders of magnitude larger LMs. Furthermore, RLTs maintain their effectivenesswhen training larger students and when applied zero-shot to out-of-distributiontasks, unlocking new levels of efficiency and re-usability for the RL reasoningframework.

Quick Read (beta)

loading the full paper ...