JudgeBench: A Benchmark for Evaluating LLM-based Judges

Abstract

LLM-based judges have emerged as a scalable alternative to human evaluationand are increasingly used to assess, compare, and improve models. However, thereliability of LLM-based judges themselves is rarely scrutinized. As LLMsbecome more advanced, their responses grow more sophisticated, requiringstronger judges to evaluate them. Existing benchmarks primarily focus on ajudge's alignment with human preferences, but often fail to account for morechallenging tasks where crowdsourced human preference is a poor indicator offactual and logical correctness. To address this, we propose a novel evaluationframework to objectively evaluate LLM-based judges. Based on this framework, wepropose JudgeBench, a benchmark for evaluating LLM-based judges on challengingresponse pairs spanning knowledge, reasoning, math, and coding. JudgeBenchleverages a novel pipeline for converting existing difficult datasets intochallenging response pairs with preference labels reflecting objectivecorrectness. Our comprehensive evaluation on a collection of prompted judges,fine-tuned judges, multi-agent judges, and reward models shows that JudgeBenchposes a significantly greater challenge than previous benchmarks, with manystrong models (e.g., GPT-4o) performing just slightly better than randomguessing. Overall, JudgeBench offers a reliable platform for assessingincreasingly advanced LLM-based judges. Data and code are available athttps://github.com/ScalerLab/JudgeBench .

Quick Read (beta)

loading the full paper ...