Understanding Impact of Human Feedback via Influence Functions

Abstract

In Reinforcement Learning from Human Feedback (RLHF), it is crucial to learnsuitable reward models from human feedback to align large language models(LLMs) with human intentions. However, human feedback can often be noisy,inconsistent, or biased, especially when evaluating complex responses. Suchfeedback can lead to misaligned reward signals, potentially causing unintendedside effects during the RLHF process. To address these challenges, we explorethe use of influence functions to measure the impact of human feedback on theperformance of reward models. We propose a compute-efficient approximationmethod that enables the application of influence functions to LLM-based rewardmodels and large-scale preference datasets. Our experiments showcase two keyapplications of influence functions: (1) detecting common labeler biases inhuman feedback datasets and (2) guiding labelers in refining their strategiesto better align with expert feedback. By quantifying the impact of humanfeedback, we believe that influence functions can enhance feedbackinterpretability and contribute to scalable oversight in RLHF, helping labelersprovide more accurate and consistent feedback. Source code is available athttps://github.com/mintaywon/IF_RLHF

Quick Read (beta)

loading the full paper ...