Leftover-Lunch: Advantage-based Offline Reinforcement Learning for Language Models

  • 2024-03-26 19:07:01
  • Ashutosh Baheti, Ximing Lu, Faeze Brahman, Ronan Le Bras, Maarten Sap, Mark Riedl
Reinforcement Learning with Human Feedback (RLHF) is the most prominentmethod for Language Model (LM) alignment. However, RLHF is an unstable anddata-hungry process that continually requires new high-quality LM-generateddata for finetuning. We introduce Advantage-Leftover Lunch RL (A-LoL), a newclass of offline policy gradient algorithms that enable RL training on anypre-existing data. By assuming the entire LM output sequence as a singleaction, A-LoL allows incorporating sequence-level classifiers or human-designedscoring functions as rewards. Subsequently, by using LM's value estimate, A-LoLonly trains on positive advantage (leftover) data points, making it resilientto noise. Overall, A-LoL is an easy-to-implement, sample-efficient, and stableLM training recipe. We demonstrate the effectiveness of A-LoL and its variants with a set of fourdifferent language generation tasks. We compare against both online RL (PPO)and recent preference-based (DPO, PRO) and reward-based (GOLD) offline RLbaselines. On the commonly-used RLHF benchmark, Helpful and Harmless Assistant(HHA), LMs trained with A-LoL methods achieve the highest diversity while alsobeing rated more safe and helpful than the baselines according to humans.Additionally, in the remaining three tasks, A-LoL could optimize multipledistinct reward functions even when using noisy or suboptimal training data. We also release our experimental code. https://github.com/abaheti95/LoL-RL


