Machine learning is currently dominated by largely experimental work focusedon improvements in a few key tasks. However, the impressive accuracy numbers ofthe best performing models are questionable because the same test sets havebeen used to select these models for multiple years now. To understand thedanger of overfitting, we measure the accuracy of CIFAR-10 classifiers bycreating a new test set of truly unseen images. Although we ensure that the newtest set is as close to the original data distribution as possible, we find alarge drop in accuracy (4% to 10%) for a broad range of deep learning models.Yet more recent models with higher original accuracy show a smaller drop andbetter overall performance, indicating that this drop is likely not due tooverfitting based on adaptivity. Instead, we view our results as evidence thatcurrent accuracy numbers are brittle and susceptible to even minute naturalvariations in the data distribution.