Understanding Reinforcement Learning for Model Training, and future directions with GRAPE

  • 2025-10-21 15:29:40
  • Rohit Patel
  • 0

Abstract

This paper provides a self-contained, from-scratch, exposition of keyalgorithms for instruction tuning of models: SFT, Rejection Sampling,REINFORCE, Trust Region Policy Optimization (TRPO), Proximal PolicyOptimization (PPO), Group Relative Policy Optimization (GRPO), and DirectPreference Optimization (DPO). Explanations of these algorithms often assumeprior knowledge, lack critical details, and/or are overly generalized andcomplex. Here, each method is discussed and developed step by step usingsimplified and explicit notation focused on LLMs, aiming to eliminate ambiguityand provide a clear and intuitive understanding of the concepts. By minimizingdetours into the broader RL literature and connecting concepts to LLMs, weeliminate superfluous abstractions and reduce cognitive overhead. Followingthis exposition, we provide a literature review of new techniques andapproaches beyond those detailed. Finally, new ideas for research andexploration in the form of GRAPE (Generalized Relative Advantage PolicyEvolution) are presented.

 

Quick Read (beta)

loading the full paper ...