The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance

Abstract

Cloze testing is a common method for measuring the behavior of large languagemodels on a number of benchmark tasks. Using the MMLU dataset, we show that thebase-rate probability (BRP) differences across answer tokens are significantand affect task performance ie. guess A if uncertain. We find thatcounterfactual prompting does sufficiently mitigate the BRP effect. The BRPeffect is found to have a similar effect to test taking strategies employed byhumans leading to the conflation of task performance and test-taking ability.We propose the Nvr-X-MMLU task, a variation of MMLU, which helps todisambiguate test-taking ability from task performance and reports the latter.

Quick Read (beta)

loading the full paper ...