Online Difficulty Filtering for Reasoning Oriented Reinforcement Learning

Abstract

Reasoning-Oriented Reinforcement Learning (RORL) enhances the reasoningability of Large Language Models (LLMs). However, due to the sparsity ofrewards in RORL, effective training is highly dependent on the selection ofproblems of appropriate difficulty. Although curriculum learning attempts toaddress this by adjusting difficulty, it often relies on static schedules, andeven recent online filtering methods lack theoretical grounding and asystematic understanding of their effectiveness. In this work, we theoreticallyand empirically show that curating the batch with the problems that thetraining model achieves intermediate accuracy on the fly can maximize theeffectiveness of RORL training, namely balanced online difficulty filtering. Wefirst derive that the lower bound of the KL divergence between the initial andthe optimal policy can be expressed with the variance of the sampled accuracy.Building on those insights, we show that balanced filtering can maximize thelower bound, leading to better performance. Experimental results across fivechallenging math reasoning benchmarks show that balanced online filteringyields an additional 10% in AIME and 4% improvements in average over plainGRPO. Moreover, further analysis shows the gains in sample efficiency andtraining time efficiency, exceeding the maximum reward of plain GRPO within 60%training time and the volume of the training set.

Quick Read (beta)

loading the full paper ...