Abstract
Active testing enables label-efficient evaluation of models through carefuldata acquisition. However, its significant computational costs have previouslyundermined its use for large models. We show how it can be successfully scaledup to the evaluation of large language models (LLMs). In particular we showthat the surrogate model used to guide data acquisition can be constructedcheaply using in-context learning, does not require updating within anactive-testing loop, and can be smaller than the target model. We even find wecan make good data-acquisition decisions without computing predictions with thetarget model and further introduce a single-run error estimator to asses howwell active testing is working on the fly. We find that our approach is able tomore effectively evaluate LLM performance with less data than current standardpractices.