PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling

Abstract

Reward models (RMs) are central to reinforcement learning from human feedback(RLHF), providing the critical supervision signals that align large languagemodels (LLMs) with human preferences. While generative reward models (GRMs)offer greater interpretability than traditional scalar RMs, current trainingparadigms remain limited. Pair-wise methods rely on binary good-versus-badlabels, which cause mismatches for point-wise inference and necessitate complexpairing strategies for effective application in RLHF. On the other hand,point-wise methods require more elaborate absolute labeling with rubric-drivencriteria, resulting in poor adaptability and high annotation costs. In thiswork, we propose the Preference-Aware Task-Adaptive Reward Model (PaTaRM), aunified framework that integrates a preference-aware reward (PAR) mechanismwith dynamic rubric adaptation. PaTaRM leverages relative preferenceinformation from pairwise data to construct robust point-wise training signals,eliminating the need for explicit point-wise labels. Simultaneously, it employsa task-adaptive rubric system that flexibly generates evaluation criteria forboth global task consistency and instance-specific fine-grained reasoning. Thisdesign enables efficient, generalizable, and interpretable reward modeling forRLHF. Extensive experiments show that PaTaRM achieves an average relativeimprovement of 4.7% on RewardBench and RMBench across Qwen3-8B and Qwen3-14Bmodels. Furthermore, PaTaRM boosts downstream RLHF performance, with an averageimprovement of 13.6% across IFEval and InFoBench benchmarks, confirming itseffectiveness and robustness. Our code is available athttps://github.com/JaneEyre0530/PaTaRM.

Quick Read (beta)

loading the full paper ...