Mapping Overlaps in Benchmarks through Perplexity in the Wild

Abstract

We develop signatures of capacity familiarity to characterize large languagemodel (LLM) benchmarks and their meaningful overlaps. Benchmark signaturesprobe the capacity required for benchmark performance. We formally define themas a set of salient tokens drawn from in-the-wild, naturally authored corpora,where LLM token perplexity, reflecting more or less pre-training exposure,becomes highly predictive of LLM benchmark performance. Through a large-scalemeta-evaluation, we extract benchmark signatures via stepwise forward selectionwith linear regressions across 32 LLMs and 88 benchmarks spanning diverseknowledge, coding, logic, instruction following, math, language, reasoning, andworld modeling. Our analysis situates signatures in relation to both thesemantic similarity of benchmark questions and the correlation of modelperformance. While performance overlaps are universally high and semanticoverlaps remain confined to a narrow mid-range, benchmark signatures provehighly informative in capturing variation, overlap, and divergence. We observeoverlap in knowledge and reasoning subtasks, whereas multilingual and culturalbenchmarks exhibit less similarity, even compared to cross-task overlap.Notably, performance-level results are strongly influenced bybenchmark-orthogonal factors such as question format, highlighting limitationsin LLM generalization, the conflation of performance with ability, and issuesinherent in current mainstream benchmark agreement studies. Benchmarksignatures, however, remain robust to such effects. Ultimately, we identifycross-functional overlaps across logic, math, language, instruction following,and world modeling, with coding emerging as the least overlapping domain.Together, these findings provide mechanistic insights into benchmark validityand LLM sensitivities, and sketch the underlying landscape of interconnectedLLM capabilities.

Quick Read (beta)

loading the full paper ...