Abstract
Reinforcement learning from human feedback (RLHF) has proven effective inaligning large language models (LLMs) with human preferences, but gatheringhigh-quality preference labels is expensive. RL from AI Feedback (RLAIF),introduced in Bai et al., offers a promising alternative that trains the rewardmodel (RM) on preferences generated by an off-the-shelf LLM. Across the tasksof summarization, helpful dialogue generation, and harmless dialoguegeneration, we show that RLAIF achieves comparable performance to RLHF.Furthermore, we take a step towards "self-improvement" by demonstrating thatRLAIF can outperform a supervised fine-tuned baseline even when the AI labeleris the same size as the policy, or even the exact same checkpoint as theinitial policy. Finally, we introduce direct-RLAIF (d-RLAIF) - a technique thatcircumvents RM training by obtaining rewards directly from an off-the-shelf LLMduring RL, which achieves superior performance to canonical RLAIF. Our resultssuggest that RLAIF can achieve performance on-par with using human feedback,offering a potential solution to the scalability limitations of RLHF.