Abstract
As language models become more powerful, training and evaluation areincreasingly bottlenecked by the data and metrics used for a particular task.For example, summarization models are often trained to predict human referencesummaries and evaluated using ROUGE, but both of these metrics are roughproxies for what we really care about---summary quality. In this work, we showthat it is possible to significantly improve summary quality by training amodel to optimize for human preferences. We collect a large, high-qualitydataset of human comparisons between summaries, train a model to predict thehuman-preferred summary, and use that model as a reward function to fine-tune asummarization policy using reinforcement learning. We apply our method to aversion of the TL;DR dataset of Reddit posts and find that our modelssignificantly outperform both human reference summaries and much larger modelsfine-tuned with supervised learning alone. Our models also transfer to CNN/DMnews articles, producing summaries nearly as good as the human referencewithout any news-specific fine-tuning. We conduct extensive analyses tounderstand our human feedback dataset and fine-tuned models. We establish thatour reward model generalizes to new datasets, and that optimizing our rewardmodel results in better summaries than optimizing ROUGE according to humans. Wehope the evidence from our paper motivates machine learning researchers to paycloser attention to how their training loss affects the model behavior theyactually want.