Prompt-to-Leaderboard - Paper Detail

Abstract

Large language model (LLM) evaluations typically rely on aggregated metricslike accuracy or human preference, averaging across users and prompts. Thisaveraging obscures user- and prompt-specific variations in model performance.To address this, we propose Prompt-to-Leaderboard (P2L), a method that producesleaderboards specific to a prompt. The core idea is to train an LLM takingnatural language prompts as input to output a vector of Bradley-Terrycoefficients which are then used to predict the human preference vote. Theresulting prompt-dependent leaderboards allow for unsupervised task-specificevaluation, optimal routing of queries to models, personalization, andautomated evaluation of model strengths and weaknesses. Data from Chatbot Arenasuggest that P2L better captures the nuanced landscape of language modelperformance than the averaged leaderboard. Furthermore, our findings suggestthat P2L's ability to produce prompt-specific evaluations follows a power lawscaling similar to that observed in LLMs themselves. In January 2025, therouter we trained based on this methodology achieved the \#1 spot in theChatbot Arena leaderboard. Our code is available at this GitHub link:https://github.com/lmarena/p2l.

Quick Read (beta)

loading the full paper ...