Abstract
Assessing response quality to instructions in language models is vital butchallenging due to the complexity of human language across different contexts.This complexity often results in ambiguous or inconsistent interpretations,making accurate assessment difficult. To address this issue, we propose a novelUncertainty-aware Reward Model (URM) that introduces a robust uncertaintyestimation for the quality of paired responses based on Bayesian approximation.Trained with preference datasets, our uncertainty-enabled proxy not only scoresrewards for responses but also evaluates their inherent uncertainty. Empiricalresults demonstrate significant benefits of incorporating the proposed proxyinto language model training. Our method boosts the instruction followingcapability of language models by refining data curation for training andimproving policy optimization objectives, thereby surpassing existing methodsby a large margin on benchmarks such as Vicuna and MT-bench. These findingshighlight that our proposed approach substantially advances language modeltraining and paves a new way of harnessing uncertainty within language models.