Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning from Human Feedback

Abstract

Reinforcement learning from human feedback serves as a crucial bridge,aligning large language models with human and societal values. This alignmentrequires a vast corpus of human feedback to learn a reward model, which issubsequently used to finetune language models. However, we have identified thatthe reward model often finds shortcuts to bypass its intended objectives,misleadingly assuming that humans prefer longer responses. The emergence oflength bias often induces the model to favor longer outputs, yet it doesn'tequate to an increase in helpful information within these outputs. In thispaper, we propose an innovative solution, applying the Product-of-Experts (PoE)technique to separate reward modeling from the influence of sequence length. Inour framework, the main expert concentrates on understanding human intents,while the biased expert targets the identification and capture of length bias.To further enhance the learning of bias, we introduce perturbations into thebias-focused expert, disrupting the flow of semantic information. Experimentalresults validate the effectiveness of our approach, indicating that languagemodel performance is improved, irrespective of sequence length.

Quick Read (beta)

loading the full paper ...