Abstract
Large Language Models (LLMs) have significantly advanced the state-of-the-artin various coding tasks. Beyond directly answering user queries, LLMs can alsoserve as judges, assessing and comparing the quality of responses generated byother models. Such an evaluation capability is crucial both for benchmarkingdifferent LLMs and for improving response quality through response ranking.However, despite the growing adoption of the LLM-as-a-Judge paradigm, itseffectiveness in coding scenarios remains underexplored due to the absence ofdedicated benchmarks. To address this gap, we introduce CodeJudgeBench, abenchmark explicitly designed to evaluate the performance of LLM-as-a-Judgemodels across three critical coding tasks: code generation, code repair, andunit test generation. Through comprehensive benchmarking of 26 LLM-as-a-Judgemodels, we find that recent thinking models significantly outperformnon-thinking models on our carefully designed code judging tasks. Notably, evenrelatively small thinking models, such as Qwen3-8B, can outperform speciallytrained LLM-as-a-Judge models up to 70B in size. Nevertheless, all models stillexhibit significant randomness in their judgment of coding tasks. For pairwisejudging tasks, simply changing the order in which responses are presented cansubstantially impact accuracy. In addition, when judging code and unit testswritten by different LLMs, LLM-as-a-Judge models also show variance inperformance. This sensitivity raises concerns about the reliability andconsistency of LLM-as-a-Judge in coding scenarios. Lastly, we study optimalprompting strategies for LLM-as-a-Judge. We find that using pair-wisecomparison outperforms scalar point-wise judging. Furthermore, retainingcomments and reasoning in the full, unprocessed LLM response leads to improvedjudge performance.