Reward Shaping to Mitigate Reward Hacking in RLHF

Abstract

Reinforcement Learning from Human Feedback (RLHF) is essential for aligninglarge language models (LLMs) with human values. However, RLHF is susceptible to\emph{reward hacking}, where the agent exploits flaws in the reward functionrather than learning the intended behavior, thus degrading alignment. Althoughreward shaping helps stabilize RLHF and partially mitigate reward hacking, asystematic investigation into shaping techniques and their underlyingprinciples remains lacking. To bridge this gap, we present a comprehensivestudy of the prevalent reward shaping methods. Our analysis suggests two keydesign principles: (1) the RL reward should be bounded, and (2) the RL rewardbenefits from rapid initial growth followed by gradual convergence. Guided bythese insights, we propose Preference As Reward (PAR), a novel approach thatleverages the latent preferences embedded within the reward model as the signalfor reinforcement learning. We evaluated PAR on two base models, Gemma2-2B, andLlama3-8B, using two datasets, Ultrafeedback-Binarized and HH-RLHF.Experimental results demonstrate PAR's superior performance over other rewardshaping methods. On the AlpacaEval 2.0 benchmark, PAR achieves a win rate of atleast 5 percentage points higher than competing approaches. Furthermore, PARexhibits remarkable data efficiency, requiring only a single reference rewardfor optimal performance, and maintains robustness against reward hacking evenafter two full epochs of training. The code is available athttps://github.com/PorUna-byte/PAR, and the Work done during the internship atStepFun by Jiayi Fu.

Quick Read (beta)

loading the full paper ...