RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments

Abstract

We introduce Reinforcement Learning (RL) with Adaptive VerifiableEnvironments (RLVE), an approach using verifiable environments thatprocedurally generate problems and provide algorithmically verifiable rewards,to scale up RL for language models (LMs). RLVE enables each verifiableenvironment to dynamically adapt its problem difficulty distribution to thepolicy model's capabilities as training progresses. In contrast, static datadistributions often lead to vanishing learning signals when problems are eithertoo easy or too hard for the policy. To implement RLVE, we create RLVE-Gym, alarge-scale suite of 400 verifiable environments carefully developed throughmanual environment engineering. Using RLVE-Gym, we show that environmentscaling, i.e., expanding the collection of training environments, consistentlyimproves generalizable reasoning capabilities. RLVE with joint training acrossall 400 environments in RLVE-Gym yields a 3.37% absolute average improvementacross six reasoning benchmarks, starting from one of the strongest 1.5Breasoning LMs. By comparison, continuing this LM's original RL training yieldsonly a 0.49% average absolute gain despite using over 3x more compute. Werelease our code publicly.

Quick Read (beta)

loading the full paper ...