Reinforcement Learning without Human Feedback for Last Mile Fine-Tuning of Large Language Models

Abstract

Reinforcement learning is used to align language models with human preferencesignals after first pre-training the model to predict the next token of textwithin a large corpus using likelihood maximization. Before being deployed in aspecific domain, models are often further fine-tuned on task specific data.Since human preferences are often unavailable for the last step, it isperformed using likelihood maximization as that is the typical default method.However, reinforcement learning has other advantages besides facilitatingalignment to a human derived reward function. For one, whereas likelihoodmaximization is a form of imitation learning in which the model is trained onwhat to do under ideal conditions, reinforcement learning is not limited todemonstrating actions just for optimally reached states and trains a model whatto do under a range of scenarios as it explores the policy space. In addition,it also trains a model what not to do, suppressing competitive but pooractions. This work develops a framework for last-mile fine-tuning usingreinforcement learning and tests whether it garners performance gains. Theexperiments center on abstractive summarization, but the framework is generaland broadly applicable. Use of the procedure produced significantly betterresults than likelihood maximization when comparing raw predictions. For thespecific data tested, the gap could be bridged by employing post-processing ofthe maximum likelihood outputs. Nonetheless, the framework offers a new avenuefor model optimization in situations where post-processing may be lessstraightforward or effective, and it can be extended to include more complexclasses of undesirable outputs to penalize and train against, such ashallucinations.

Quick Read (beta)

loading the full paper ...