Abstract
Open-ended text generation has become a prominent task in natural languageprocessing due to the rise of powerful (large) language models. However,evaluating the quality of these models and the employed decoding strategiesremains challenging due to trade-offs among widely used metrics such ascoherence, diversity, and perplexity. This paper addresses the specific problemof multicriteria evaluation for open-ended text generation, proposing novelmethods for both relative and absolute rankings of decoding methods.Specifically, we employ benchmarking approaches based on partial orderings andpresent a new summary metric to balance existing automatic indicators,providing a more holistic evaluation of text generation quality. Ourexperiments demonstrate that the proposed approaches offer a robust way tocompare decoding strategies and serve as valuable tools to guide modelselection for open-ended text generation tasks. We suggest future directionsfor improving evaluation methodologies in text generation and make our code,datasets, and models publicly available.