Abstract
The evaluation of large language models is a complex task, in which severalapproaches have been proposed. The most common is the use of automatedbenchmarks in which LLMs have to answer multiple-choice questions of differenttopics. However, this method has certain limitations, being the mostconcerning, the poor correlation with the humans. An alternative approach, isto have humans evaluate the LLMs. This poses scalability issues as there is alarge and growing number of models to evaluate making it impractical (andcostly) to run traditional studies based on recruiting a number of evaluatorsand having them rank the responses of the models. An alternative approach isthe use of public arenas, such as the popular LM arena, on which any user canfreely evaluate models on any question and rank the responses of two models.The results are then elaborated into a model ranking. An increasingly importantaspect of LLMs is their energy consumption and, therefore, evaluating howenergy awareness influences the decisions of humans in selecting a model is ofinterest. In this paper, we present GEA, the Generative Energy Arena, an arenathat incorporates information on the energy consumption of the model in theevaluation process. Preliminary results obtained with GEA are also presented,showing that for most questions, when users are aware of the energyconsumption, they favor smaller and more energy efficient models. This suggeststhat for most user interactions, the extra cost and energy incurred by the morecomplex and top-performing models do not provide an increase in the perceivedquality of the responses that justifies their use.