Abstract
Multiple-choice question (MCQ) datasets like Massive Multitask LanguageUnderstanding (MMLU) are widely used to evaluate the commonsense,understanding, and problem-solving abilities of large language models (LLMs).However, the open-source nature of these benchmarks and the broad sources oftraining data for LLMs have inevitably led to benchmark contamination,resulting in unreliable evaluation results. To alleviate this issue, we proposea contamination-free and more challenging MCQ benchmark called MMLU-CF. Thisbenchmark reassesses LLMs' understanding of world knowledge by averting bothunintentional and malicious data leakage. To avoid unintentional data leakage,we source data from a broader domain and design three decontamination rules. Toprevent malicious data leakage, we divide the benchmark into validation andtest sets with similar difficulty and subject distributions. The test setremains closed-source to ensure reliable results, while the validation set ispublicly available to promote transparency and facilitate independentverification. Our evaluation of mainstream LLMs reveals that the powerfulGPT-4o achieves merely a 5-shot score of 73.4% and a 0-shot score of 71.9% onthe test set, which indicates the effectiveness of our approach in creating amore rigorous and contamination-free evaluation standard. The GitHub repositoryis available at https://github.com/microsoft/MMLU-CF and the dataset refers tohttps://huggingface.co/datasets/microsoft/MMLU-CF.