Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models

Abstract

Reinforcement learning with verifiable rewards (RLVR), which typically adoptsPass@1 as the reward, has faced the issues in balancing exploration andexploitation, causing policies to prefer conservative actions, converging to alocal optimum. Identifying an appropriate reward metric is therefore crucial.Regarding the prior work, although Pass@k has been used in evaluation, itsconnection to LLM exploration ability in RLVR remains largely overlooked. Toinvestigate this, we first use Pass@k as the reward to train the policy model(i.e., $\textbf{Pass@k Training}$), and observe the improvement on itsexploration ability. Next, we derive an analytical solution for the advantageof Pass@k Training, leading to an efficient and effective process. Building onthis, our analysis reveals that exploration and exploitation are not inherentlyconflicting objectives, while they can mutually enhance each other. Moreover,Pass@k Training with analytical derivation essentially involves directlydesigning the advantage function. Inspired by this, we preliminarily explorethe advantage design for RLVR, showing promising results and highlighting apotential future direction.

Quick Read (beta)

loading the full paper ...