Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

Abstract

LLM-as-a-Judge has been widely utilized as an evaluation method in variousbenchmarks and served as supervised rewards in model training. However, despitetheir excellence in many domains, potential issues are under-explored,undermining their reliability and the scope of their utility. Therefore, weidentify 12 key potential biases and propose a new automated biasquantification framework-CALM-which systematically quantifies and analyzes eachtype of bias in LLM-as-a-Judge by using automated and principle-guidedmodification. Our experiments cover multiple popular language models, and theresults indicate that while advanced models have achieved commendable overallperformance, significant biases persist in certain specific tasks. Empiricalresults suggest that there remains room for improvement in the reliability ofLLM-as-a-Judge. Moreover, we also discuss the explicit and implicit influenceof these biases and give some suggestions for the reliable application ofLLM-as-a-Judge. Our work highlights the need for stakeholders to address theseissues and remind users to exercise caution in LLM-as-a-Judge applications.

Quick Read (beta)

loading the full paper ...