Reinforcement Learning Enhanced LLMs: A Survey

Abstract

This paper surveys research in the rapidly growing field of enhancing largelanguage models (LLMs) with reinforcement learning (RL), a technique thatenables LLMs to improve their performance by receiving feedback in the form ofrewards based on the quality of their outputs, allowing them to generate moreaccurate, coherent, and contextually appropriate responses. In this work, wemake a systematic review of the most up-to-date state of knowledge onRL-enhanced LLMs, attempting to consolidate and analyze the rapidly growingresearch in this field, helping researchers understand the current challengesand advancements. Specifically, we (1) detail the basics of RL; (2) introducepopular RL-enhanced LLMs; (3) review researches on two widely-used rewardmodel-based RL techniques: Reinforcement Learning from Human Feedback (RLHF)and Reinforcement Learning from AI Feedback (RLAIF); and (4) explore DirectPreference Optimization (DPO), a set of methods that bypass the reward model todirectly use human preference data for aligning LLM outputs with humanexpectations. We will also point out current challenges and deficiencies ofexisting methods and suggest some avenues for further improvements. Projectpage of this work can be found at:\url{https://github.com/ShuheWang1998/Reinforcement-Learning-Enhanced-LLMs-A-Survey}.

Quick Read (beta)

loading the full paper ...