Abstract
As AI systems continue to grow, particularly generative models like LargeLanguage Models (LLMs), their rigorous evaluation is crucial for developmentand deployment. To determine their adequacy, researchers have developed variouslarge-scale benchmarks against a so-called gold-standard test set and reportmetrics averaged across all items. However, this static evaluation paradigmincreasingly shows its limitations, including high computational costs, datacontamination, and the impact of low-quality or erroneous items on evaluationreliability and efficiency. In this Perspective, drawing from humanpsychometrics, we discuss a paradigm shift from static evaluation methods toadaptive testing. This involves estimating the characteristics and value ofeach test item in the benchmark and dynamically adjusting items in real-time,tailoring the evaluation based on the model's ongoing performance instead ofrelying on a fixed test set. This paradigm not only provides a more robustability estimation but also significantly reduces the number of test itemsrequired. We analyze the current approaches, advantages, and underlying reasonsfor adopting psychometrics in AI evaluation. We propose that adaptive testingwill become the new norm in AI model evaluation, enhancing both the efficiencyand effectiveness of assessing advanced intelligence systems.