Parameter Efficient Reinforcement Learning from Human Feedback

  • 2024-09-12 19:25:16
  • Hakim Sidahmed, Samrat Phatale, Alex Hutcheson, Zhuonan Lin, Zhang Chen, Zac Yu, Jarvis Jin, Simral Chaudhary, Roman Komarytsia, Christiane Ahlheim, Yonghao Zhu, Bowen Li, Saravanan Ganesh, Bill Byrne, Jessica Hoffmann, Hassan Mansoor, Wei Li, Abhinav Rastogi, Lucas Dixon
  • 0

Abstract

While Reinforcement Learning from Human Feedback (RLHF) effectively alignspretrained Large Language and Vision-Language Models (LLMs, and VLMs) withhuman preferences, its computational cost and complexity hamper its wideradoption. To alleviate some of the computational burden of fine-tuning,parameter efficient methods, like LoRA were introduced. In this work, weempirically evaluate the setup of Parameter Efficient Reinforcement Learningfrom Human Feedback (PE-RLHF) that leverages LoRA fine-tuning for RewardModeling, and Reinforcement Learning. We benchmark the PE-RLHF setup on sixdiverse datasets spanning summarization, harmless/helpful response generation,UI automation, and visual question answering in terms of effectiveness of thetrained models, and the training resources required. Our findings show, for thefirst time, that PE-RLHF achieves comparable performance to RLHF, whilesignificantly reducing training time (up to 90% faster for reward models, and30% faster for RL), and memory footprint (up to 50% reduction for rewardmodels, and 27% for RL). We provide comprehensive ablations across LoRA ranks,and model sizes for both reward modeling and reinforcement learning. Bymitigating the computational burden associated with RLHF, we push for a broaderadoption of PE-RLHF as an alignment technique for LLMs and VLMs.

 

Quick Read (beta)

loading the full paper ...