D2PO: Discriminator-Guided DPO with Response Evaluation Models

Abstract

Varied approaches for aligning language models have been proposed, includingsupervised fine-tuning, RLHF, and direct optimization methods such as DPO.Although DPO has rapidly gained popularity due to its straightforward trainingprocess and competitive results, there is an open question of whether thereremain practical advantages of using a discriminator, like a reward model, toevaluate responses. We propose D2PO, discriminator-guided DPO, an approach forthe online setting where preferences are being collected throughout learning.As we collect gold preferences, we use these not only to train our policy, butto train a discriminative response evaluation model to silver-label even moresynthetic data for policy training. We explore this approach across a set ofdiverse tasks, including a realistic chat setting, we find that our approachleads to higher-quality outputs compared to DPO with the same data budget, andgreater efficiency in terms of preference data requirements. Furthermore, weshow conditions under which silver labeling is most helpful: it is mosteffective when training the policy with DPO, outperforming traditional PPO, andbenefits from maintaining a separate discriminator from the policy model.

Quick Read (beta)

loading the full paper ...