Curiosity-Driven Reinforcement Learning from Human Feedback

  • 2025-01-20 12:51:40
  • Haoran Sun, Yekun Chai, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang
  • 0

Abstract

Reinforcement learning from human feedback (RLHF) has proven effective inaligning large language models (LLMs) with human preferences, but often at thecost of reduced output diversity. This trade-off between diversity andalignment quality remains a significant challenge. Drawing inspiration fromcuriosity-driven exploration in reinforcement learning, we introducecuriosity-driven RLHF (CD-RLHF), a framework that incorporates intrinsicrewards for novel states, alongside traditional sparse extrinsic rewards, tooptimize both output diversity and alignment quality. We demonstrate theeffectiveness of CD-RLHF through extensive experiments on a range of tasks,including text summarization and instruction following. Our approach achievessignificant gains in diversity on multiple diversity-oriented metrics whilemaintaining alignment with human preferences comparable to standard RLHF. Wemake our code publicly available at https://github.com/ernie-research/CD-RLHF.

 

Quick Read (beta)

loading the full paper ...