A Baseline Analysis of Reward Models' Ability To Accurately Analyze Foundation Models Under Distribution Shift

Abstract

Foundation models, specifically Large Language Models (LLM's), have latelygained wide-spread attention and adoption. Reinforcement Learning with HumanFeedback (RLHF) involves training a reward model to capture desired behaviors,which is then used to align an LLM. These reward models are additionally usedat inference-time to estimate how well LLM responses adhere to those desiredbehaviors. However, there is little work measuring how robust these rewardmodels are to distribution shifts. In this work, we evaluate how reward modelperformance - measured via accuracy and calibration (i.e. alignment betweenaccuracy and confidence) - is affected by distribution shift. We show novelcalibration patterns and accuracy drops due to OOD prompts and responses, andthat the reward model is more sensitive to shifts in responses than prompts.Additionally, we adapt an OOD detection technique commonly used inclassification to the reward model setting in order to detect thesedistribution shifts in prompts and responses.

Quick Read (beta)

loading the full paper ...