Abstract
Reward models trained with conventional Reinforcement Learning from AIFeedback (RLAIF) methods suffer from limited generalizability, which hindersthe alignment performance of the policy model during reinforcement learning(RL). This challenge stems from various issues, including distribution shift,preference label noise, and mismatches between overly challenging samples andmodel capacity. In this paper, we attempt to enhance the generalizability ofreward models through a data-centric approach, driven by the insight that theseissues are inherently intertwined from the perspective of data difficulty. Toaddress this, we propose a novel framework, $\textit{Curriculum-RLAIF}$, whichconstructs preference pairs with varying difficulty levels and produces acurriculum that progressively incorporates preference pairs of increasingdifficulty for reward model training. Our experimental results suggest thatreward models trained with Curriculum-RLAIF achieve improved generalizability,significantly increasing the alignment performance of the policy model by alarge margin without incurring additional inference costs compared to variousnon-curriculum baselines. Detailed analysis and comparisons with alternativeapproaches, including data selection via external pretrained reward models orinternal self-selection mechanisms, as well as other curriculum strategies,further demonstrate the superiority of our approach in terms of simplicity,efficiency, and effectiveness.