Abstract
Proprietary LMs such as GPT-4 are often employed to assess the quality ofresponses from various LMs. However, concerns including transparency,controllability, and affordability strongly motivate the development ofopen-source LMs specialized in evaluations. On the other hand, existing openevaluator LMs exhibit critical shortcomings: 1) they issue scores thatsignificantly diverge from those assigned by humans, and 2) they lack theflexibility to perform both direct assessment and pairwise ranking, the twomost prevalent forms of assessment. Additionally, they do not possess theability to evaluate based on custom evaluation criteria, focusing instead ongeneral attributes like helpfulness and harmlessness. To address these issues,we introduce Prometheus 2, a more powerful evaluator LM than its predecessorthat closely mirrors human and GPT-4 judgements. Moreover, it is capable ofprocessing both direct assessment and pair-wise ranking formats grouped with auser-defined evaluation criteria. On four direct assessment benchmarks and fourpairwise ranking benchmarks, Prometheus 2 scores the highest correlation andagreement with humans and proprietary LM judges among all tested open evaluatorLMs. Our models, code, and data are all publicly available athttps://github.com/prometheus-eval/prometheus-eval.