Abstract
Aligning large language models (LLMs) with human preferences is critical torecent advances in generative artificial intelligence. Reinforcement learningfrom human feedback (RLHF) is widely applied to achieve this objective. A keystep in RLHF is to learn the reward function from human feedback. However,human feedback is costly and time-consuming, making it essential to collecthigh-quality conversation data for human teachers to label. Additionally,different human teachers have different levels of expertise. It is thuscritical to query the most appropriate teacher for their opinions. In thispaper, we use offline reinforcement learning (RL) to formulate the alignmentproblem. Motivated by the idea of $D$-optimal design, we first propose a dualactive reward learning algorithm for the simultaneous selection ofconversations and teachers. Next, we apply pessimistic RL to solve thealignment problem, based on the learned reward estimator. Theoretically, weshow that the reward estimator obtained through our proposed adaptive selectionstrategy achieves minimal generalized variance asymptotically, and prove thatthe sub-optimality of our pessimistic policy scales as $O(1/\sqrt{T})$ with agiven sample budget $T$. Through simulations and experiments on LLMs, wedemonstrate the effectiveness of our algorithm and its superiority overstate-of-the-arts.