Abstract
Pretrained language models often do not perform tasks in ways that are inline with our preferences, e.g., generating offensive text or factuallyincorrect summaries. Recent work approaches the above issue by learning from asimple form of human evaluation: comparisons between pairs of model-generatedtask outputs. Comparison feedback conveys limited information about humanpreferences per human evaluation. Here, we propose to learn from naturallanguage feedback, which conveys more information per human evaluation. Welearn from language feedback on model outputs using a three-step learningalgorithm. First, we condition the language model on the initial output andfeedback to generate many refinements. Second, we choose the refinement withthe highest similarity to the feedback. Third, we finetune a language model tomaximize the likelihood of the chosen refinement given the input. In syntheticexperiments, we first evaluate whether language models accurately incorporatefeedback to produce refinements, finding that only large language models (175Bparameters) do so. Using only 100 samples of human-written feedback, ourlearning algorithm finetunes a GPT-3 model to roughly human-level summarizationability.