Mitigating Attention Hacking in Preference-Based Reward Modeling via Interaction Distillation

Abstract

The reward model (RM), as the core component of reinforcement learning fromhuman feedback (RLHF) for large language models (LLMs), responsible forproviding reward signals to generated responses. However, mainstream preferencemodeling in RM is inadequate in terms of token-level interaction, making itsjudgment signals vulnerable to being hacked by misallocated attention tocontext. This stems from two fundamental limitations: (1) Current preferencemodeling employs decoder-only architectures, where the unidirectional causalattention mechanism leads to forward-decaying intra-sequence attention withinthe prompt-response sequence. (2) The independent Siamese-encoding paradigminduces the absence of token-level inter-sequence attention between chosen andrejected sequences. To address this "attention hacking", we propose"Interaction Distillation", a novel training framework for more adequatepreference modeling through attention-level optimization. The method introducesan interaction-based natural language understanding model as the teacher toprovide sophisticated token interaction patterns via comprehensive attention,and guides the preference modeling to simulate teacher model's interactionpattern through an attentional alignment objective. Through extensiveexperiments, interaction distillation has demonstrated its ability to providemore stable and generalizable reward signals compared to state-of-the-art RMoptimization methods that target data noise, highlighting the attention hackingconstitute a more fundamental limitation in RM.

Quick Read (beta)

loading the full paper ...