Robust Reinforcement Learning from Corrupted Human Feedback

Abstract

Reinforcement learning from human feedback (RLHF) provides a principledframework for aligning AI systems with human preference data. For variousreasons, e.g., personal bias, context ambiguity, lack of training, etc, humanannotators may give incorrect or inconsistent preference labels. To tackle thischallenge, we propose a robust RLHF approach -- $R^3M$, which models thepotentially corrupted preference label as sparse outliers. Accordingly, weformulate the robust reward learning as an $\ell_1$-regularized maximumlikelihood estimation problem. Computationally, we develop an efficientalternating optimization algorithm, which only incurs negligible computationaloverhead compared with the standard RLHF approach. Theoretically, we prove thatunder proper regularity conditions, $R^3M$ can consistently learn theunderlying reward and identify outliers, provided that the number of outlierlabels scales sublinearly with the preference sample size. Furthermore, weremark that $R^3M$ is versatile and can be extended to various preferenceoptimization methods, including direct preference optimization (DPO). Ourexperiments on robotic control and natural language generation with largelanguage models (LLMs) show that $R^3M$ improves robustness of the rewardagainst several types of perturbations to the preference data.

Quick Read (beta)

loading the full paper ...