Leftover-Lunch: Advantage-based Offline Reinforcement Learning for Language Models

Abstract

Reinforcement Learning with Human Feedback (RLHF) is the most prominentmethod for Language Model (LM) alignment. However, RLHF is an unstable anddata-hungry process that continually requires new high-quality LM-generateddata for finetuning. We introduce Advantage-Leftover Lunch RL (A-LoL), a newclass of offline policy gradient algorithms that enable RL training on anypre-existing data. By assuming the entire LM output sequence as a singleaction, A-LoL allows incorporating sequence-level classifiers or human-designedscoring functions as rewards. Subsequently, by using LM's value estimate, A-LoLonly trains on positive advantage (leftover) data points, making it resilientto noise. Overall, A-LoL is an easy-to-implement, sample-efficient, and stableLM training recipe. We demonstrate the effectiveness of A-LoL and its variants with a set of fourdifferent language generation tasks. We compare against both online RL (PPO)and recent preference-based (DPO, PRO) and reward-based (GOLD) offline RLbaselines. On the commonly-used RLHF benchmark, Helpful and Harmless Assistant(HHA), LMs trained with A-LoL methods achieve the highest diversity while alsobeing rated more safe and helpful than the baselines according to humans.Additionally, in the remaining three tasks, A-LoL could optimize multipledistinct reward functions even when using noisy or suboptimal training data. We also release our experimental code. https://github.com/abaheti95/LoL-RL

Quick Read (beta)

loading the full paper ...