Reinforcement Learning from Human Feedback

Abstract

Reinforcement learning from human feedback (RLHF) has become an importanttechnical and storytelling tool to deploy the latest machine learning systems.In this book, we hope to give a gentle introduction to the core methods forpeople with some level of quantitative background. The book starts with theorigins of RLHF -- both in recent literature and in a convergence of disparatefields of science in economics, philosophy, and optimal control. We then setthe stage with definitions, problem formulation, data collection, and othercommon math used in the literature. The core of the book details everyoptimization stage in using RLHF, from starting with instruction tuning totraining a reward model and finally all of rejection sampling, reinforcementlearning, and direct alignment algorithms. The book concludes with advancedtopics -- understudied research questions in synthetic data and evaluation --and open questions for the field.

Quick Read (beta)

loading the full paper ...