State of What Art? A Call for Multi-Prompt LLM Evaluation

Abstract

Recent advances in large language models (LLMs) have led to the developmentof various evaluation benchmarks. These benchmarks typically rely on a singleinstruction template for evaluating all LLMs on a specific task. In this paper,we comprehensively analyze the brittleness of results obtained viasingle-prompt evaluations across 6.5M instances, involving 20 different LLMsand 39 tasks from 3 benchmarks. To improve robustness of the analysis, wepropose to evaluate LLMs with a set of diverse prompts instead. We discusstailored evaluation metrics for specific use cases (e.g., LLM developers vs.developers interested in a specific downstream task), ensuring a more reliableand meaningful assessment of LLM capabilities. We then implement these criteriaand conduct evaluations of multiple models, providing insights into the truestrengths and limitations of current LLMs.

Quick Read (beta)

loading the full paper ...