Overconfidence in LLM-as-a-Judge: Diagnosis and Confidence-Driven Solution

Abstract

Large Language Models (LLMs) are widely used as automated judges, wherepractical value depends on both accuracy and trustworthy, risk-aware judgments.Existing approaches predominantly focus on accuracy, overlooking the necessityof well-calibrated confidence, which is vital for adaptive and reliableevaluation pipelines. In this work, we advocate a shift from accuracy-centricevaluation to confidence-driven, risk-aware LLM-as-a-Judge systems, emphasizingthe necessity of well-calibrated confidence for trustworthy and adaptiveevaluation. We systematically identify the Overconfidence Phenomenon in currentLLM-as-a-Judges, where predicted confidence significantly overstates actualcorrectness, undermining reliability in practical deployment. To quantify thisphenomenon, we introduce TH-Score, a novel metric measuring confidence-accuracyalignment. Furthermore, we propose LLM-as-a-Fuser, an ensemble framework thattransforms LLMs into reliable, risk-aware evaluators. Extensive experimentsdemonstrate that our approach substantially improves calibration and enablesadaptive, confidence-driven evaluation pipelines, achieving superiorreliability and accuracy compared to existing baselines.

Quick Read (beta)

loading the full paper ...