Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges

Abstract

As large language models (LLMs) grow more capable, they face increasinglydiverse and complex tasks, making reliable evaluation challenging. The paradigmof LLMs as judges has emerged as a scalable solution, yet prior work primarilyfocuses on simple settings. Their reliability in complex tasks--wheremulti-faceted rubrics, unstructured reference answers, and nuanced criteria arecritical--remains understudied. In this paper, we constructed ComplexEval, achallenge benchmark designed to systematically expose and quantify AuxiliaryInformation Induced Biases. We systematically investigated and validated 6previously unexplored biases across 12 basic and 3 advanced scenarios. Keyfindings reveal: (1) all evaluated models exhibit significant susceptibility tothese biases, with bias magnitude scaling with task complexity; (2) notably,Large Reasoning Models (LRMs) show paradoxical vulnerability. Our in-depthanalysis offers crucial insights for improving the accuracy and verifiabilityof evaluation signals, paving the way for more general and robust evaluationmodels.

Quick Read (beta)

loading the full paper ...