Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks

Abstract

We identify label errors in the test sets of 10 of the most commonly-usedcomputer vision, natural language, and audio datasets, and subsequently studythe potential for these label errors to affect benchmark results. Errors intest sets are numerous and widespread: we estimate an average of 3.4% errorsacross the 10 datasets, where for example 2916 label errors comprise 6% of theImageNet validation set. Putative label errors are identified using confidentlearning algorithms and then human-validated via crowdsourcing (54% of thealgorithmically-flagged candidates are indeed erroneously labeled).Traditionally, machine learning practitioners choose which model to deploybased on test accuracy - our findings advise caution here, proposing thatjudging models over correctly labeled test sets may be more useful, especiallyfor noisy real-world datasets. Surprisingly, we find that lower capacity modelsmay be practically more useful than higher capacity models in real-worlddatasets with high proportions of erroneously labeled data. For example, onImageNet with corrected labels: ResNet-18 outperforms ResNet50 if theprevalence of originally mislabeled test examples increases by just 6%. OnCIFAR-10 with corrected labels: VGG-11 outperforms VGG-19 if the prevalence oforiginally mislabeled test examples increases by just 5%.

Quick Read (beta)

loading the full paper ...