CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution

  • 2024-10-21 18:56:51
  • Maosong Cao, Alexander Lam, Haodong Duan, Hongwei Liu, Songyang Zhang, Kai Chen
  • 0

Abstract

Efficient and accurate evaluation is crucial for the continuous improvementof large language models (LLMs). Among various assessment methods, subjectiveevaluation has garnered significant attention due to its superior alignmentwith real-world usage scenarios and human preferences. However, human-basedevaluations are costly and lack reproducibility, making precise automatedevaluators (judgers) vital in this process. In this report, we introduce\textbf{CompassJudger-1}, the first open-source \textbf{all-in-one} judge LLM.CompassJudger-1 is a general-purpose LLM that demonstrates remarkableversatility. It is capable of: 1. Performing unitary scoring and two-modelcomparisons as a reward model; 2. Conducting evaluations according to specifiedformats; 3. Generating critiques; 4. Executing diverse tasks like a generalLLM. To assess the evaluation capabilities of different judge models under aunified setting, we have also established \textbf{JudgerBench}, a new benchmarkthat encompasses various subjective evaluation tasks and covers a wide range oftopics. CompassJudger-1 offers a comprehensive solution for various evaluationtasks while maintaining the flexibility to adapt to diverse requirements. BothCompassJudger and JudgerBench are released and available to the researchcommunity athttps://github.com/open-compass/CompassJudger. We believe that byopen-sourcing these tools, we can foster collaboration and accelerate progressin LLM evaluation methodologies.

 

Quick Read (beta)

loading the full paper ...