Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback

Abstract

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligninglarge language models with human preferences. While recent research has focusedon algorithmic improvements, the importance of prompt-data construction hasbeen overlooked. This paper addresses this gap by exploring data-drivenbottlenecks in RLHF performance scaling, particularly reward hacking anddecreasing response diversity. We introduce a hybrid reward system combiningreasoning task verifiers (RTV) and a generative reward model (GenRM) tomitigate reward hacking. We also propose a novel prompt-selection method,Pre-PPO, to maintain response diversity and enhance learning effectiveness.Additionally, we find that prioritizing mathematical and coding tasks early inRLHF training significantly improves performance. Experiments across two modelsizes validate our methods' effectiveness and scalability. Results show thatRTV is most resistant to reward hacking, followed by GenRM with ground truth,and then GenRM with SFT Best-of-N responses. Our strategies enable rapidcapture of subtle task-specific distinctions, leading to substantialimprovements in overall RLHF performance. This work highlights the importanceof careful data construction and provides practical methods to overcomeperformance barriers in RLHF.

Quick Read (beta)

loading the full paper ...