Abstract
Learning from human preferences is important for language models to behelpful and useful for humans, and to align with human and social values. Priorwork have achieved remarkable successes by learning from human feedback tounderstand and follow instructions. Nonetheless, these methods are eitherfounded on hand-picked model generations that are favored by human annotators,rendering them ineffective in terms of data utilization and challenging toapply in general, or they depend on reward functions and reinforcementlearning, which are prone to imperfect reward function and extremelychallenging to optimize. In this work, we propose a novel technique, Chain ofHindsight, that is easy to optimize and can learn from any form of feedback,regardless of its polarity. Our idea is inspired by how humans learn fromextensive feedback presented in the form of languages. We convert all types offeedback into sentences, which are then used to fine-tune the model, allowingus to take advantage of the language comprehension capabilities of languagemodels. We condition the model on a sequence of model generations paired withfeedback. By doing so, models are trained to generate outputs based onfeedback, and models can learn to identify and correct negative attributes orerrors. Applying our method to large language models, we observed that Chain ofHindsight significantly surpasses previous methods in aligning language modelswith human preferences. We observed significant improvements on summarizationand dialogue tasks and our approach is markedly preferred in human evaluations.