Machine learning systems can often achieve high performance on a test set byrelying on heuristics that are effective for frequent example types but breakdown in more challenging cases. We study this issue within natural languageinference (NLI), the task of determining whether one sentence entails another.Based on an analysis of the task, we hypothesize three fallible syntacticheuristics that NLI models are likely to adopt: the lexical overlap heuristic,the subsequence heuristic, and the constituent heuristic. To determine whethermodels have adopted these heuristics, we introduce a controlled evaluation setcalled HANS (Heuristic Analysis for NLI Systems), which contains many exampleswhere the heuristics fail. We find that models trained on MNLI, including thestate-of-the-art model BERT, perform very poorly on HANS, suggesting that theyhave indeed adopted these heuristics. We conclude that there is substantialroom for improvement in NLI systems, and that the HANS dataset can motivate andmeasure progress in this area.