WARM: On the Benefits of Weight Averaged Reward Models

Abstract

Aligning large language models (LLMs) with human preferences throughreinforcement learning (RLHF) can lead to reward hacking, where LLMs exploitfailures in the reward model (RM) to achieve seemingly high rewards withoutmeeting the underlying objectives. We identify two primary challenges whendesigning RMs to mitigate reward hacking: distribution shifts during the RLprocess and inconsistencies in human preferences. As a solution, we proposeWeight Averaged Reward Models (WARM), first fine-tuning multiple RMs, thenaveraging them in the weight space. This strategy follows the observation thatfine-tuned weights remain linearly mode connected when sharing the samepre-training. By averaging weights, WARM improves efficiency compared to thetraditional ensembling of predictions, while improving reliability underdistribution shifts and robustness to preference inconsistencies. Ourexperiments on summarization tasks, using best-of-N and RL methods, shows thatWARM improves the overall quality and alignment of LLM predictions; forexample, a policy RL fine-tuned with WARM has a 79.4% win rate against a policyRL fine-tuned with a single RM.

Quick Read (beta)

loading the full paper ...