Abstract
Reinforcement learning (RL) has become a key technique for enhancing LLMs'reasoning abilities, yet its data inefficiency remains a major bottleneck. Toaddress this critical yet challenging issue, we present a novelgradient-alignment-based method, named LearnAlign, which intelligently selectsthe learnable and representative training reasoning data for RL post-training.To overcome the well-known issue of response-length bias in gradient norms, weintroduce the data learnability based on the success rate, which can indicatethe learning potential of each data point. Experiments across threemathematical reasoning benchmarks demonstrate that our method significantlyreduces training data requirements while achieving minor performancedegradation or even improving performance compared to full-data training. Forexample, it reduces data requirements by up to 1,000 data points with betterperformance (77.53%) than that on the full dataset on GSM8K benchmark (77.04%).Furthermore, we show its effectiveness in the staged RL setting. This workprovides valuable insights into data-efficient RL post-training and establishesa foundation for future research in optimizing reasoning data selection.Tofacilitate future work, we will release code.