Abstract
Existing evaluation of Large Language Models (LLMs) on static benchmarks isvulnerable to data contamination and leaderboard overfitting, critical issuesthat obscure true model capabilities. To address this, we introduce LLMEval-3,a framework for dynamic evaluation of LLMs. LLMEval-3 is built on a proprietarybank of 220k graduate-level questions, from which it dynamically samples unseentest sets for each evaluation run. Its automated pipeline ensures integrity viacontamination-resistant data curation, a novel anti-cheating architecture, anda calibrated LLM-as-a-judge process achieving 90% agreement with human experts,complemented by a relative ranking system for fair comparison. An 20-monthlongitudinal study of nearly 50 leading models reveals a performance ceiling onknowledge memorization and exposes data contamination vulnerabilitiesundetectable by static benchmarks. The framework demonstrates exceptionalrobustness in ranking stability and consistency, providing strong empiricalvalidation for the dynamic evaluation paradigm. LLMEval-3 offers a robust andcredible methodology for assessing the true capabilities of LLMs beyondleaderboard scores, promoting the development of more trustworthy evaluationstandards.