Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

Abstract

Recent statements about the impressive capabilities of large language models(LLMs) are usually supported by evaluating on open-access benchmarks.Considering the vast size and wide-ranging sources of LLMs' training data, itcould explicitly or implicitly include test data, leading to LLMs being moresusceptible to data contamination. However, due to the opacity of trainingdata, the black-box access of models, and the rapid growth of synthetictraining data, detecting and mitigating data contamination for LLMs facessignificant challenges. In this paper, we propose CDD, which stands forContamination Detection via output Distribution for LLMs. CDD necessitates onlythe sampled texts to detect data contamination, by identifying the peakednessof LLM's output distribution. To mitigate the impact of data contamination inevaluation, we also present TED: Trustworthy Evaluation via outputDistribution, based on the correction of LLM's output distribution. Tofacilitate this study, we introduce two benchmarks, i.e., DetCon and ComiEval,for data contamination detection and contamination mitigation evaluation tasks.Extensive experimental results show that CDD achieves the average relativeimprovements of 21.8\%-30.2\% over other contamination detection approaches interms of Accuracy, F1 Score, and AUC metrics, and can effectively detectimplicit contamination. TED substantially mitigates performance improvements upto 66.9\% attributed to data contamination across various contamination setups.In real-world applications, we reveal that ChatGPT exhibits a high potential tosuffer from data contamination on HumanEval benchmark.

Quick Read (beta)

loading the full paper ...