Abstract
Recent advances in large language models (LLMs) have led to the developmentof various evaluation benchmarks. These benchmarks typically rely on a singleinstruction template for evaluating all LLMs on a specific task. In this paper,we comprehensively analyze the brittleness of results obtained viasingle-prompt evaluations across 6.5M instances, involving 20 different LLMsand 39 tasks from 3 benchmarks. To improve robustness of the analysis, wepropose to evaluate LLMs with a set of diverse prompts instead. We discusstailored evaluation metrics for specific use cases (e.g., LLM developers vs.developers interested in a specific downstream task), ensuring a more reliableand meaningful assessment of LLM capabilities. We then implement these criteriaand conduct evaluations of multiple models, providing insights into the truestrengths and limitations of current LLMs.