DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors

Abstract

Open benchmarks are essential for evaluating and advancing large languagemodels, offering reproducibility and transparency. However, their accessibilitymakes them likely targets of test set contamination. In this work, we introduceDyePack, a framework that leverages backdoor attacks to identify models thatused benchmark test sets during training, without requiring access to the loss,logits, or any internal details of the model. Like how banks mix dye packs withtheir money to mark robbers, DyePack mixes backdoor samples with the test datato flag models that trained on it. We propose a principled design incorporatingmultiple backdoors with stochastic targets, enabling exact false positive rate(FPR) computation when flagging every model. This provably prevents falseaccusations while providing strong evidence for every detected case ofcontamination. We evaluate DyePack on five models across three datasets,covering both multiple-choice and open-ended generation tasks. Formultiple-choice questions, it successfully detects all contaminated models withguaranteed FPRs as low as 0.000073% on MMLU-Pro and 0.000017% on Big-Bench-Hardusing eight backdoors. For open-ended generation tasks, it generalizes well andidentifies all contaminated models on Alpaca with a guaranteed false positiverate of just 0.127% using six backdoors.

Quick Read (beta)

loading the full paper ...