RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

  • 2024-09-03 15:01:54
  • Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, Sushant Prakash
  • 0

Abstract

Reinforcement learning from human feedback (RLHF) has proven effective inaligning large language models (LLMs) with human preferences, but gatheringhigh-quality preference labels is expensive. RL from AI Feedback (RLAIF),introduced in Bai et al., offers a promising alternative that trains the rewardmodel (RM) on preferences generated by an off-the-shelf LLM. Across the tasksof summarization, helpful dialogue generation, and harmless dialoguegeneration, we show that RLAIF achieves comparable performance to RLHF.Furthermore, we take a step towards "self-improvement" by demonstrating thatRLAIF can outperform a supervised fine-tuned baseline even when the AI labeleris the same size as the policy, or even the exact same checkpoint as theinitial policy. Finally, we introduce direct-RLAIF (d-RLAIF) - a technique thatcircumvents RM training by obtaining rewards directly from an off-the-shelf LLMduring RL, which achieves superior performance to canonical RLAIF. Our resultssuggest that RLAIF can achieve performance on-par with using human feedback,offering a potential solution to the scalability limitations of RLHF.

 

Quick Read (beta)

loading the full paper ...