Abstract
We study Reinforcement Learning from Human Feedback (RLHF), where multipleindividuals with diverse preferences provide feedback strategically to sway thefinal policy in their favor. We show that existing RLHF methods are notstrategyproof, which can result in learning a substantially misaligned policyeven when only one out of $k$ individuals reports their preferencesstrategically. In turn, we also find that any strategyproof RLHF algorithm mustperform $k$-times worse than the optimal policy, highlighting an inherenttrade-off between incentive alignment and policy alignment. We then propose apessimistic median algorithm that, under appropriate coverage assumptions, isapproximately strategyproof and converges to the optimal policy as the numberof individuals and samples increases.