CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution

Abstract

Efficient and accurate evaluation is crucial for the continuous improvementof large language models (LLMs). Among various assessment methods, subjectiveevaluation has garnered significant attention due to its superior alignmentwith real-world usage scenarios and human preferences. However, human-basedevaluations are costly and lack reproducibility, making precise automatedevaluators (judgers) vital in this process. In this report, we introduce\textbf{CompassJudger-1}, the first open-source \textbf{all-in-one} judge LLM.CompassJudger-1 is a general-purpose LLM that demonstrates remarkableversatility. It is capable of: 1. Performing unitary scoring and two-modelcomparisons as a reward model; 2. Conducting evaluations according to specifiedformats; 3. Generating critiques; 4. Executing diverse tasks like a generalLLM. To assess the evaluation capabilities of different judge models under aunified setting, we have also established \textbf{JudgerBench}, a new benchmarkthat encompasses various subjective evaluation tasks and covers a wide range oftopics. CompassJudger-1 offers a comprehensive solution for various evaluationtasks while maintaining the flexibility to adapt to diverse requirements. BothCompassJudger and JudgerBench are released and available to the researchcommunity athttps://github.com/open-compass/CompassJudger. We believe that byopen-sourcing these tools, we can foster collaboration and accelerate progressin LLM evaluation methodologies.

Quick Read (beta)

loading the full paper ...