Abstract
Large Language Models (LLMs) have recently been shown to be effective asautomatic evaluators with simple prompting and in-context learning. In thiswork, we assemble 15 LLMs of four different size ranges and evaluate theiroutput responses by preference ranking from the other LLMs as evaluators, suchas System Star is better than System Square. We then evaluate the quality ofranking outputs introducing the Cognitive Bias Benchmark for LLMs as Evaluators(CoBBLEr), a benchmark to measure six different cognitive biases in LLMevaluation outputs, such as the Egocentric bias where a model prefers to rankits own outputs highly in evaluation. We find that LLMs are biased text qualityevaluators, exhibiting strong indications on our bias benchmark (average of 40%of comparisons across all models) within each of their evaluations thatquestion their robustness as evaluators. Furthermore, we examine thecorrelation between human and machine preferences and calculate the averageRank-Biased Overlap (RBO) score to be 49.6%, indicating that machinepreferences are misaligned with humans. According to our findings, LLMs maystill be unable to be utilized for automatic annotation aligned with humanpreferences. Our project page is at: https://minnesotanlp.github.io/cobbler.