PIRA: Preference-Oriented Instruction-Tuned Reward Models with Dual Aggregation

Abstract

Reward models are pivotal for aligning Large Language Models (LLMs) with human preferences. Existing approaches face two key limitations: Discriminative reward models require large-scale annotated data, as they cannot exploit the preference instruction-following capability of LLMs available to generative reward models. Moreover, reward models are particularly prone to reward overoptimization, where LLMs exploit weaknesses in the reward function instead of improving true alignment. We introduce \textbf{PIRA}, a training paradigm that integrates three complementary strategies to address these challenges: (1) reformulating question-answer pairs into preference-task instructions to explicitly leverage LLMs' preference instruction-following capability, (2) averaging the rewards aggregated from diverse preference-task instructions for each sample, which mitigates task-specific bias and enhances robustness across evaluation perspectives, and (3) averaging outputs from the value head under different dropout rates to stabilize reward estimation. Experiments on public datasets show that PIRA improves performance considerably, enhances generalization, and effectively mitigates reward overoptimization.

Quick Read (beta)

loading the full paper ...