JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Abstract

Evaluating Large Language Models (LLMs) in open-ended scenarios ischallenging because existing benchmarks and metrics can not measure themcomprehensively. To address this problem, we propose to fine-tune LLMs asscalable judges (JudgeLM) to evaluate LLMs efficiently and effectively inopen-ended benchmarks. We first propose a comprehensive, large-scale,high-quality dataset containing task seeds, LLMs-generated answers, andGPT-4-generated judgments for fine-tuning high-performance judges, as well as anew benchmark for evaluating the judges. We train JudgeLM at different scalesfrom 7B, 13B, to 33B parameters, and conduct a systematic analysis of itscapabilities and behaviors. We then analyze the key biases in fine-tuning LLMas a judge and consider them as position bias, knowledge bias, and format bias.To address these issues, JudgeLM introduces a bag of techniques including swapaugmentation, reference support, and reference drop, which clearly enhance thejudge's performance. JudgeLM obtains the state-of-the-art judge performance onboth the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLMis efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving anagreement exceeding 90% that even surpasses human-to-human agreement. JudgeLMalso demonstrates extended capabilities in being judges of the single answer,multimodal models, multiple answers, and multi-turn chat.

Quick Read (beta)

loading the full paper ...