Curiosity-Driven Reinforcement Learning from Human Feedback

Abstract

Reinforcement learning from human feedback (RLHF) has proven effective inaligning large language models (LLMs) with human preferences, but often at thecost of reduced output diversity. This trade-off between diversity andalignment quality remains a significant challenge. Drawing inspiration fromcuriosity-driven exploration in reinforcement learning, we introducecuriosity-driven RLHF (CD-RLHF), a framework that incorporates intrinsicrewards for novel states, alongside traditional sparse extrinsic rewards, tooptimize both output diversity and alignment quality. We demonstrate theeffectiveness of CD-RLHF through extensive experiments on a range of tasks,including text summarization and instruction following. Our approach achievessignificant gains in diversity on multiple diversity-oriented metrics whilemaintaining alignment with human preferences comparable to standard RLHF. Wemake our code publicly available at https://github.com/ernie-research/CD-RLHF.

Quick Read (beta)

loading the full paper ...