Evaluation for many natural language understanding (NLU) tasks is broken:Unreliable and biased systems score so highly on standard benchmarks that thereis little room for researchers who develop better systems to demonstrate theirimprovements. The recent trend to abandon IID benchmarks in favor ofadversarially-constructed, out-of-distribution test sets ensures that currentmodels will perform poorly, but ultimately only obscures the abilities that wewant our benchmarks to measure. In this position paper, we lay out fourcriteria that we argue NLU benchmarks should meet. We argue most currentbenchmarks fail at these criteria, and that adversarial data collection doesnot meaningfully address the causes of these failures. Instead, restoring ahealthy evaluation ecosystem will require significant progress in the design ofbenchmark datasets, the reliability with which they are annotated, their size,and the ways they handle social bias.