Fine-Tuning Language Models from Human Preferences

Abstract

Reward learning enables the application of reinforcement learning (RL) totasks where reward is defined by human judgment, building a model of reward byasking humans questions. Most work on reward learning has used simulatedenvironments, but complex information about values is often expressed innatural language, and we believe reward learning for language is a key tomaking RL practical and safe for real-world tasks. In this paper, we build onadvances in generative pretraining of language models to apply reward learningto four natural language tasks: continuing text with positive sentiment orphysically descriptive language, and summarization tasks on the TL;DR andCNN/Daily Mail datasets. For stylistic continuation we achieve good resultswith only 5,000 comparisons evaluated by humans. For summarization, modelstrained with 60,000 comparisons copy whole sentences from the input but skipirrelevant preamble; this leads to reasonable ROUGE scores and very goodperformance according to our human labelers, but may be exploiting the factthat labelers rely on simple heuristics.

Quick Read (beta)

loading the full paper ...