Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback

Abstract

Reinforcement Learning from Human Feedback (RLHF) allows us to train models,such as language models (LMs), to follow complex human preferences. In RLHF forLMs, we first train an LM using supervised fine-tuning, sample pairs ofresponses, obtain human feedback, and use the resulting data to train a rewardmodel (RM). RL methods are then used to train the LM to maximize the rewardgiven by the RM. As training progresses, the responses generated by the LM nolonger resemble the responses seen by the RM during training, leading to the RMbecoming inaccurate. The score given by the RM keeps increasing, but thelearned behavior no longer matches the human preferences. This issue is knownas overoptimization. We investigate overoptimization from the point of view ofdistribution shift and show that the shift results in an inconsistent estimateof the RM parameters, leading to an inconsistent estimate of the policygradient. We propose Off-Policy Corrected Reward Modeling (OCRM), whichiteratively off-policy corrects the RM using importance weighting, withoutrequiring new labels or samples. This results in a more accurate RM, whichempirically leads to an improved final policy. We validate our approach inexperiments with summarization and chatbot datasets and show that it performssignificantly better than standard RLHF methods and baselines. Ourimplementation is available athttps://github.com/JohannesAck/OffPolicyCorrectedRewardModeling

Quick Read (beta)

loading the full paper ...