SPEED-RL: Faster Training of Reasoning Models via Online Curriculum Learning

Abstract

Training large language models with reinforcement learning (RL) againstverifiable rewards significantly enhances their reasoning abilities, yetremains computationally expensive due to inefficient uniform prompt sampling.We introduce Selective Prompting with Efficient Estimation of Difficulty(SPEED), an adaptive online RL curriculum that selectively chooses trainingexamples of intermediate difficulty to maximize learning efficiency.Theoretically, we establish that intermediate-difficulty prompts improve thegradient estimator's signal-to-noise ratio, accelerating convergence.Empirically, our efficient implementation leads to 2x to 6x faster trainingwithout degrading accuracy, requires no manual tuning, and integratesseamlessly into standard RL algorithms.

Quick Read (beta)

loading the full paper ...