CREAM: Consistency Regularized Self-Rewarding Language Models

Abstract

Recent self-rewarding large language models (LLM) have successfully appliedLLM-as-a-Judge to iteratively improve the alignment performance without theneed of human annotations for preference data. These methods commonly utilizethe same LLM to act as both the policy model (which generates responses) andthe reward model (which scores and ranks those responses). The ranked responsesare then used as preference pairs to train the LLM via direct alignmenttechnologies (e.g. DPO). However, it is noteworthy that throughout thisprocess, there is no guarantee of accuracy in the rewarding and ranking, whichis critical for ensuring accurate rewards and high-quality preference data.Empirical results from relatively small LLMs (e.g., 7B parameters) alsoindicate that improvements from self-rewarding may diminish after severaliterations in certain situations, which we hypothesize is due to accumulatedbias in the reward system. This bias can lead to unreliable preference data fortraining the LLM. To address this issue, we first formulate and analyze thegeneralized iterative preference fine-tuning framework for self-rewardinglanguage model. We then introduce the regularization to this generalizedframework to mitigate the overconfident preference labeling in theself-rewarding process. Based on this theoretical insight, we propose aConsistency Regularized sElf-rewarding lAnguage Model (CREAM) that leveragesthe rewarding consistency across different iterations to regularize theself-rewarding training, helping the model to learn from more reliablepreference data. With this explicit regularization, our empirical resultsdemonstrate the superiority of CREAM in improving both reward consistency andalignment performance. The code is publicly available athttps://github.com/Raibows/CREAM.

Quick Read (beta)

loading the full paper ...