Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries

  • 2025-10-21 12:29:05
  • Marius Dragoi, Ioana Pintilie, Florin Gogianu, Florin Brad
  • 0

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as apowerful paradigm to improve Large Language Models on reasoning tasks such ascoding, math or logic. To assess the reasoning boundary (the fraction ofproblems a model can solve) researchers often report Pass@k at large samplingbudgets. Recent results reveal a crossover phenomenon: while RLVR modelsoutperform the base model at small k values, the base model usually outperformsthem when sampling a very large number of completions. This has beeninterpreted as evidence that base models have a larger reasoning boundary. Weargue that on tasks with discrete answer spaces, such as math with numericoutputs, Pass@k at large k reflects the increasingly higher chance of successin the limit of the number of trials rather than genuine reasoning, and cantherefore be misleading. We propose Cover@tau, which measures the fraction ofproblems that a model can solve for which at least a tau proportion ofcompletions are correct. Unlike Pass@k, Cover@tau captures reasoning under anexplicit reliability threshold: models that rely on random guessing degraderapidly as tau increases. We evaluate several RLVR models using Cover@tau-basedmetrics and illustrate how the relative rankings of popular algorithms changecompared to Pass@1, offering a different perspective on reasoning boundaries.

 

Quick Read (beta)

loading the full paper ...