A Long Way to Go: Investigating Length Correlations in RLHF

Abstract

Great successes have been reported using Reinforcement Learning from HumanFeedback (RLHF) to align large language models. Open-source preference datasetsand reward models have enabled wider experimentation beyond generic chatsettings, particularly to make systems more "helpful" for tasks like webquestion answering, summarization, and multi-turn dialogue. When optimizing forhelpfulness, RLHF has been consistently observed to drive models to producelonger outputs. This paper demonstrates that optimizing for response length isa significant factor behind RLHF's reported improvements in these settings.First, we study the relationship between reward and length for reward modelstrained on three open-source preference datasets for helpfulness. Here, lengthcorrelates strongly with reward, and improvements in reward score are driven inlarge part by shifting the distribution over output lengths. We then exploreinterventions during both RL and reward model learning to see if we can achievethe same downstream improvements as RLHF without increasing length. While ourinterventions mitigate length increases, they aren't uniformly effective acrosssettings. Furthermore, we find that even running RLHF with a reward basedsolely on length can reproduce most of the downstream improvements over theinitial policy model, showing that reward models in these settings have a longway to go.

Quick Read (beta)

loading the full paper ...