Abstract
This paper provides a self-contained, from-scratch, exposition of keyalgorithms for instruction tuning of models: SFT, Rejection Sampling,REINFORCE, Trust Region Policy Optimization (TRPO), Proximal PolicyOptimization (PPO), Group Relative Policy Optimization (GRPO), and DirectPreference Optimization (DPO). Explanations of these algorithms often assumeprior knowledge, lack critical details, and/or are overly generalized andcomplex. Here, each method is discussed and developed step by step usingsimplified and explicit notation focused on LLMs, aiming to eliminate ambiguityand provide a clear and intuitive understanding of the concepts. By minimizingdetours into the broader RL literature and connecting concepts to LLMs, weeliminate superfluous abstractions and reduce cognitive overhead. Followingthis exposition, we provide a literature review of new techniques andapproaches beyond those detailed. Finally, new ideas for research andexploration in the form of GRAPE (Generalized Relative Advantage PolicyEvolution) are presented.