T-REG: Preference Optimization with Token-Level Reward Regularization

Abstract

Reinforcement learning from human feedback (RLHF) has been crucial inaligning large language models (LLMs) with human values. Traditionally, RLHFinvolves generating responses to a query and using a reward model to assign areward to the entire response. However, this approach faces challenges due toits reliance on a single, sparse reward, which makes it challenging for themodel to identify which parts of the sequence contribute most significantly tothe final reward. Recent methods have attempted to address this limitation byintroducing token-level rewards. However, these methods often rely on either atrained credit assignment model or AI annotators, raising concerns about thequality and reliability of the rewards. In this paper, we propose token-levelreward regularization (T-REG), a novel approach that leverages bothsequence-level and token-level rewards for preference optimization. Harnessingthe self-refinement capabilities of LLMs, our method uses contrastive promptingto enable LLMs to self-generate token-level rewards. These self-generatedrewards then act as reward regularization, guiding the model to moreeffectively distribute sequence-level rewards across tokens. This facilitatesbetter token-level credit assignment and enhances alignment performance.Experiments on the instruction following benchmarks, including Alpaca Eval 2and Arena-Hard, show that our method consistently outperforms baseline methodsby up to 3.8% and 4.4%, respectively. We will release the code and models athttps://github.com/wzhouad/T-REG.

Quick Read (beta)

loading the full paper ...