Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

  • 2023-09-11 18:25:24
  • Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell
Reinforcement learning from human feedback (RLHF) is a technique for trainingAI systems to align with human goals. RLHF has emerged as the central methodused to finetune state-of-the-art large language models (LLMs). Despite thispopularity, there has been relatively little public work systematizing itsflaws. In this paper, we (1) survey open problems and fundamental limitationsof RLHF and related methods; (2) overview techniques to understand, improve,and complement RLHF in practice; and (3) propose auditing and disclosurestandards to improve societal oversight of RLHF systems. Our work emphasizesthe limitations of RLHF and highlights the importance of a multi-facetedapproach to the development of safer AI systems.


