Abstract
Reinforcement learning from human feedback (RLHF) has become a cornerstonefor aligning large language models with human preferences. However, theheterogeneity of human feedback, driven by diverse individual contexts andpreferences, poses significant challenges for reward learning. To address this,we propose a Low-rank Contextual RLHF (LoCo-RLHF) framework that integratescontextual information to better model heterogeneous feedback while maintainingcomputational efficiency. Our approach builds on a contextual preference model,leveraging the intrinsic low-rank structure of the interaction between usercontexts and query-answer pairs to mitigate the high dimensionality of featurerepresentations. Furthermore, we address the challenge of distributional shiftsin feedback through our Pessimism in Reduced Subspace (PRS) policy, inspired bypessimistic offline reinforcement learning techniques. We theoreticallydemonstrate that our policy achieves a tighter sub-optimality gap compared toexisting methods. Extensive experiments validate the effectiveness ofLoCo-RLHF, showcasing its superior performance in personalized RLHF settingsand its robustness to distribution shifts.