Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning

Abstract

Reinforcement learning from human feedback (RLHF) has emerged as a keytechnique for aligning the output of large language models (LLMs) with humanpreferences. To learn the reward function, most existing RLHF algorithms usethe Bradley-Terry model, which relies on assumptions about human preferencesthat may not reflect the complexity and variability of real-world judgments. Inthis paper, we propose a robust algorithm to enhance the performance ofexisting approaches under such reward model misspecifications. Theoretically,our algorithm reduces the variance of reward and policy estimators, leading toimproved regret bounds. Empirical evaluations on LLM benchmark datasetsdemonstrate that the proposed algorithm consistently outperforms existingmethods, with 77-81% of responses being favored over baselines on the AnthropicHelpful and Harmless dataset.

Quick Read (beta)

loading the full paper ...