Sample Efficient Preference Alignment in LLMs via Active Exploration

Abstract

Preference-based feedback is important for many applications in machinelearning where evaluation of a reward function is not feasible. Notable recentexamples arise in preference alignment for large language models, including inreinforcement learning from human feedback (RLHF) and direct preferenceoptimization (DPO). For many applications of preference alignment, the cost ofacquiring human feedback can be substantial. In this work, we take advantage ofthe fact that one can often choose contexts at which to obtain human feedbackto most efficiently identify a good policy, and formalize the setting as anactive contextual dueling bandit problem. We propose an active explorationalgorithm to efficiently select the data and provide theoretical proof that ithas a polynomial worst-case regret bound. We extend the setting and methodologyfor practical use in preference alignment of large language models. We providetwo extensions, an online and an offline approach. Our method outperforms thebaselines with limited samples of human preferences on several language modelsand four real-world datasets including two new datasets that we contribute tothe literature.

Quick Read (beta)

loading the full paper ...