Abstract
Reinforcement learning from human feedback (RLHF) is a technique for trainingAI systems to align with human goals. RLHF has emerged as the central methodused to finetune state-of-the-art large language models (LLMs). Despite thispopularity, there has been relatively little public work systematizing itsflaws. In this paper, we (1) survey open problems and fundamental limitationsof RLHF and related methods; (2) overview techniques to understand, improve,and complement RLHF in practice; and (3) propose auditing and disclosurestandards to improve societal oversight of RLHF systems. Our work emphasizesthe limitations of RLHF and highlights the importance of a multi-facetedapproach to the development of safer AI systems.
Quick Read (beta)
loading the full paper ...