Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons

  • 2025-06-04 10:46:43
  • Isik Baran Sandan, Tu Anh Dinh, Jan Niehues
  • 0

Abstract

Large Language Models (LLMs) have shown to be effective evaluators acrossvarious domains such as machine translations or the scientific domain. CurrentLLM-as-a-Judge approaches rely mostly on individual assessments or a singleround of pairwise assessments, preventing the judge LLM from developing aglobal ranking perspective. To address this, we present Knockout Assessment, anLLM-asa Judge method using a knockout tournament system with iterative pairwisecomparisons. Experiments across three LLMs on two datasets show that knockoutassessment improves scoring accuracy, increasing Pearson correlation withexpert evaluations by 0.07 on average for university-level exam scoring andmachine translation evaluations, aligning LLM assessments more closely withhuman scoring.

 

Quick Read (beta)

loading the full paper ...