Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation

Abstract

Data contamination in model evaluation is getting increasingly prevalent asthe massive training corpora of large language models often unintentionallyinclude benchmark samples. Therefore, contamination analysis has became aninevitable part of reliable model evaluation. However, existing method ofcontamination analysis requires the access of the entire training data which isoften confidential for recent models. This prevent the community to rigorouslyaudit these models and conduct accurate assessment of their capability. In thispaper, we propose a novel method to quantify contamination without the accessof the full training set, that measure the extent of contamination withperplexity. Our analysis provides evidence of significant memorisation ofrecent foundation models in popular reading comprehension, summarisationbenchmarks, while multiple choice appears less contaminated.

Quick Read (beta)

loading the full paper ...