Abstract
Process reward model (PRM) is critical for mathematical reasoning tasks toassign rewards for each intermediate steps. The PRM requires constructingprocess-wise supervision data for training, which rely on chain-of-thought(CoT) or tree-based methods to construct the reasoning steps, however, theindividual reasoning steps may be redundant or containing nuanced errors thatdifficult to detect. We attribute these to the issue of the overlook ofgranularity division during process data collection. In this paper, we proposea coarse-to-fine framework to tackle this issue. Specifically, while gatheringthe process supervision data, we collect the coarse reasoning steps by mergingadjacent steps according to preset merging granularity, then we sequentiallyreduce the merging granularity to collect fine-grained reasoning steps. Foreach synthesized new step, we relabel according to the label of last step.During training, we also traverse the collected training corpus in acoarse-to-fine manner. We conduct extensive experiments on popular mathematicalreasoning datasets across diverse loss criterions, the proposed framework canconsistently boost the reasoning performance.