JuStRank: Benchmarking LLM Judges for System Ranking

Abstract

Given the rapid progress of generative AI, there is a pressing need tosystematically compare and choose between the numerous models andconfigurations available. The scale and versatility of such evaluations makethe use of LLM-based judges a compelling solution for this challenge.Crucially, this approach requires first to validate the quality of the LLMjudge itself. Previous work has focused on instance-based assessment of LLMjudges, where a judge is evaluated over a set of responses, or response pairs,while being agnostic to their source systems. We argue that this settingoverlooks critical factors affecting system-level ranking, such as a judge'spositive or negative bias towards certain systems. To address this gap, weconduct the first large-scale study of LLM judges as system rankers. Systemscores are generated by aggregating judgment scores over multiple systemoutputs, and the judge's quality is assessed by comparing the resulting systemranking to a human-based ranking. Beyond overall judge assessment, our analysisprovides a fine-grained characterization of judge behavior, including theirdecisiveness and bias.

Quick Read (beta)

loading the full paper ...