R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback

Abstract

Reinforcement learning from human feedback (RLHF) provides a paradigm foraligning large language models (LLMs) with human preferences. This involves theinitial training of a reward model based on pairwise human feedback. The rewardmodel is subsequently utilized in reinforcement learning to assess the scoresof each generated sentence as a whole, further guiding the optimization ofLLMs. However, current approaches have a significant shortcoming: \emph{Theyallocate a single, sparse, and delayed reward to an entire sequence of output}.This may overlook some significant individual contributions of each tokentowards the desired outcome. To overcome this limitation, our paper proposes anovel reward redistribution method called R3HF, which facilitates a morefine-grained, token-level reward allocation. Specifically, our method treatsthe reward prediction task of the reward model as a regression problem. As aresult, the redistributed rewards are computed by evaluating the specificcontribution of each token to the reward model's output. This detailed approachimproves the model's understanding of language nuances, leading to more preciseenhancements in its performance. Our method is crafted to integrate seamlesslywith most current techniques while incurring minimal computational costs.Through comprehensive experiments across diverse datasets and tasks, we haveverified the effectiveness and superiority of our approach.

Quick Read (beta)

loading the full paper ...