Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons

Abstract

Large Language Models (LLMs) have shown to be effective evaluators acrossvarious domains such as machine translations or the scientific domain. CurrentLLM-as-a-Judge approaches rely mostly on individual assessments or a singleround of pairwise assessments, preventing the judge LLM from developing aglobal ranking perspective. To address this, we present Knockout Assessment, anLLM-asa Judge method using a knockout tournament system with iterative pairwisecomparisons. Experiments across three LLMs on two datasets show that knockoutassessment improves scoring accuracy, increasing Pearson correlation withexpert evaluations by 0.07 on average for university-level exam scoring andmachine translation evaluations, aligning LLM assessments more closely withhuman scoring.

Quick Read (beta)

loading the full paper ...