Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR), particularly withalgorithms like Group Relative Policy Optimization (GRPO), has proven highlyeffective in enhancing the reasoning capabilities of large language models.However, a critical bottleneck in current pipelines lies in the limiteddiversity of sampled trajectories during group rollouts. Homogeneoustrajectories and their associated rewards would diminish the return signals forpolicy updates, thereby hindering effective policy learning. This lack ofdiversity stems primarily from token-level stochastic sampling, where localvariations are likely to collapse into near-identical reasoning paths. Toaddress this limitation, we propose Lookahead Tree-Based Rollouts (LATR), anovel rollout strategy designed to explicitly promotes trajectory-leveldiversity by enforcing branching into different candidate tokens likely toyield distinct continuations. Specifically, LATR iteratively operates in threestages: (1) branching at high-uncertainty generation steps, (2) performinglookahead simulation for each new branch, and (3) pruning branches thatexhibits prolonged similarity during simulation. Compared with stochasticSampling, LATR accelerates policy learning by 131% on average and improvesfinal pass@1 performance by 4.2% on both GRPO and Dynamic sAmpling PolicyOptimization (DAPO) algorithms across different reasoning tasks. Our code anddata are publicly available at https://github.com/starreeze/latr.

Quick Read (beta)

loading the full paper ...