Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation

  • 2025-08-21 15:48:03
  • Tuhina Tripathi, Manya Wadhwa, Greg Durrett, Scott Niekum
  • 0

Abstract

Large Language Models (LLMs) are widely used as proxies for human labelers inboth training (Reinforcement Learning from AI Feedback) and large-scaleresponse evaluation (LLM-as-a-judge). Alignment and evaluation are criticalcomponents in the development of reliable LLMs, and the choice of feedbackprotocol plays a central role in both but remains understudied. In this work,we show that the choice of feedback protocol for evaluation (absolute scoresversus relative preferences) can significantly affect evaluation reliabilityand induce systematic biases. In the context of LLM-as-a-judge evaluation, weshow that pairwise protocols are more vulnerable to distracted evaluation.Generator models can exploit spurious attributes (or distractor features)favored by the LLM judge, resulting in inflated scores for lower-qualityoutputs. We find that absolute scoring is more robust to such manipulation,producing judgments that better reflect response quality and are lessinfluenced by distractor features. Our results demonstrate that generatormodels can flip preferences by embedding distractor features, skewingLLM-as-a-judge comparisons and leading to inaccurate conclusions about modelquality in benchmark evaluations. Pairwise preferences flip in about 35% of thecases, compared to only 9% for absolute scores. We offer recommendations forchoosing feedback protocols based on dataset characteristics and evaluationobjectives.

 

Quick Read (beta)

loading the full paper ...