Abstract
Large Language Models (LLMs) have shown to be effective evaluators acrossvarious domains such as machine translations or the scientific domain. CurrentLLM-as-a-Judge approaches rely mostly on individual assessments or a singleround of pairwise assessments, preventing the judge LLM from developing aglobal ranking perspective. To address this, we present Knockout Assessment, anLLM-asa Judge method using a knockout tournament system with iterative pairwisecomparisons. Experiments across three LLMs on two datasets show that knockoutassessment improves scoring accuracy, increasing Pearson correlation withexpert evaluations by 0.07 on average for university-level exam scoring andmachine translation evaluations, aligning LLM assessments more closely withhuman scoring.