Think Twice: Branch-and-Rethink Reasoning Reward Model

Abstract

Large language models (LLMs) increasingly rely on thinking models thatexternalize intermediate steps and allocate extra test-time compute, withthink-twice strategies showing that a deliberate second pass can elicitstronger reasoning. In contrast, most reward models (RMs) still compress manyquality dimensions into a single scalar in one shot, a design that inducesjudgment diffusion: attention spreads across evaluation criteria, yieldingdiluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), atwo-turn RM that transfers the think-twice principle to reward modeling. Turn 1performs adaptive branching, selecting a small set of instance-criticaldimensions (such as factuality and safety) and sketching concise,evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, atargeted reread that tests those hypotheses and scrutinizes only what mattersmost. We train with GRPO-style reinforcement learning over structured two-turntraces using a simple binary outcome reward with strict format checks, makingthe approach compatible with standard RLHF pipelines. By convertingall-at-oncescoringintofocused, second-lookreasoning,BR-RMreducesjudgmentdiffusionandimproves sensitivity to subtle yetconsequential errors while remaining practical and scalable. Experimentalresults demonstrate that our model achieves state-of-the-art performance onthree challenging reward modeling benchmarks across diverse domains. The codeand the model will be released soon.

Quick Read (beta)

loading the full paper ...