Abstract
Learning from human feedback has been shown to be effective at aligninglanguage models with human preferences. Past work has often relied onReinforcement Learning from Human Feedback (RLHF), which optimizes the languagemodel using reward scores assigned from a reward model trained on humanpreference data. In this work we show how the recently introduced SequenceLikelihood Calibration (SLiC), can also be used to effectively learn from humanpreferences (SLiC-HF). Furthermore, we demonstrate this can be done with humanfeedback data collected for a different model, similar to off-policy, offlineRL data. Automatic and human evaluation experiments on the TL;DR summarizationtask show that SLiC-HF significantly improves supervised fine-tuning baselines.Furthermore, SLiC-HF presents a competitive alternative to the PPO RLHFimplementation used in past work while being much simpler to implement, easierto tune and more computationally efficient in practice.