The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback

Abstract

Reinforcement learning from human feedback (RLHF) has emerged as a powerfultechnique to make large language models (LLMs) more capable in complexsettings. RLHF proceeds as collecting human preference data, training a rewardmodel on said data, and optimizing a base ML model with respect to said rewardfor extrinsic evaluation metrics (e.g. MMLU, GSM8k). RLHF relies on manyassumptions about how the various pieces fit together, such as a reward modelcapturing human preferences and an RL optimizer extracting the right signalfrom a reward model. As the RLHF process involves many distinct designdecisions, it is easy to assume that multiple processes are correlated andtherefore numerically linked. This apparent correlation is often not true,where reward models are easily overoptimized or RL optimizers can reduceperformance on tasks not modeled in the data. Notable manifestations of modelstrained with imperfect RLHF systems are those that are prone to refusing basicrequests for safety reasons or appearing lazy in generations. As chat modelevaluation becomes increasingly nuanced, the reliance on a perceived linkbetween reward model training, RL scores, and downstream performance drivesthese issues, which we describe as an objective mismatch. In this paper, weillustrate the causes of this issue, reviewing relevant literature frommodel-based reinforcement learning, and argue for solutions. By solvingobjective mismatch in RLHF, the ML models of the future will be more preciselyaligned to user instructions for both safety and helpfulness.

Quick Read (beta)

loading the full paper ...