Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

  • 2019-02-04 01:54:19
  • R. Thomas McCoy, Ellie Pavlick, Tal Linzen
  • 7

Abstract

Machine learning systems can often achieve high performance on a test set byrelying on heuristics that are effective for frequent example types but breakdown in more challenging cases. We study this issue within natural languageinference (NLI), the task of determining whether one sentence entails another.Based on an analysis of the task, we hypothesize three fallible syntacticheuristics that NLI models are likely to adopt: the lexical overlap heuristic,the subsequence heuristic, and the constituent heuristic. To determine whethermodels have adopted these heuristics, we introduce a controlled evaluation setcalled HANS (Heuristic Analysis for NLI Systems), which contains many exampleswhere the heuristics fail. We find that models trained on MNLI, including thestate-of-the-art model BERT, perform very poorly on HANS, suggesting that theyhave indeed adopted these heuristics. We conclude that there is substantialroom for improvement in NLI systems, and that the HANS dataset can motivate andmeasure progress in this area.

 

Quick Read (beta)

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in
Natural Language Inference

R. Thomas McCoy,1 Ellie Pavlick,2 & Tal Linzen1
1Department of Cognitive Science, Johns Hopkins University
2Department of Computer Science, Brown University
[email protected], [email protected], [email protected]
Abstract

Machine learning systems can often achieve high performance on a test set by relying on heuristics that are effective for frequent example types but break down in more challenging cases. We study this issue within natural language inference (NLI), the task of determining whether one sentence entails another. Based on an analysis of the task, we hypothesize three fallible syntactic heuristics that NLI models are likely to adopt: the lexical overlap heuristic, the subsequence heuristic, and the constituent heuristic. To determine whether models have adopted these heuristics, we introduce a controlled evaluation set called HANS (Heuristic Analysis for NLI Systems), which contains many examples where the heuristics fail. We find that models trained on MNLI, including the state-of-the-art model BERT, perform very poorly on HANS, suggesting that they have indeed adopted these heuristics. We conclude that there is substantial room for improvement in NLI systems, and that the HANS dataset can motivate and measure progress in this area.

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in
Natural Language Inference


R. Thomas McCoy,1 Ellie Pavlick,2 & Tal Linzen1 1Department of Cognitive Science, Johns Hopkins University 2Department of Computer Science, Brown University [email protected], [email protected], [email protected]

Heuristic Definition Example
Lexical overlap Assume that a premise entails all hypotheses constructed from words in the premise The doctor was paid by the actor.
WRONG The doctor paid the actor.
Subsequence Assume that a premise entails all of its contiguous subsequences. The doctor near the actor danced.
WRONG The actor danced.
Constituent Assume that a premise entails all complete subtrees in its parse tree. If the artist slept, the actor ran.
WRONG The artist slept.
Table 1: The heuristics targeted by the HANS dataset, along with examples of incorrect entailment predictions that these heuristics would lead to.

1 Introduction

Neural networks have an exceptional ability to learn the statistical patterns in a training set and apply them to similar test cases. Therefore, these systems are susceptible to adopting shallow heuristics that are effective for the majority of training examples, instead of learning the nuanced reasoning processes that they are intended to capture: if heuristics usually yield the correct output, the loss function provides little incentive for a model without a strong inductive bias to learn deeper generalizations.

Examples of this issue have been documented across domains in artificial intelligence. Deep neural networks trained to recognize objects are misled by contextual heuristics: they recognize monkeys within their typical environment with high accuracy, yet they label a monkey holding a guitar as a human, presumably because guitars never co-occur with monkeys in the training set (Wang et al., 2018). Neural visual question answering systems can answer the question what is covering the ground? with snow with excellent accuracy, but only because that question exclusively appears in the training set when it is in fact snow that is covering the ground, rather than because it has an understanding of the predicate cover the ground; such a system would be stumped by an image in which the ground is covered with leaves (Agrawal et al., 2016).

We focus on natural language inference (NLI), the task of determining whether a premise sentence entails (i.e. implies the truth of) a hypothesis sentence (Condoravdi et al., 2003; Dagan et al., 2006; Bowman et al., 2015). On this task as well, neural models have been shown to learn shallow heuristics that are based on the distributions of specific words (Naik et al., 2018; Sanchez et al., 2018). For example, models might learn to assign a label of contradiction to any example containing a negation word such as not, since such words are often associated with examples of contradiction.

In this work, we address whether NLI models might also learn higher-order statistical patterns based on superficial syntactic properties of the sentences. Consider the following sentence pair, which has the target label entailment:

\ex

. Premise: The judge was paid by the actor.
Hypothesis: The actor paid the judge.

An NLI system that labels this example correctly might do so not by reasoning about the meanings of these sentences, but rather by assuming that the premise entails any hypothesis whose words all appear in the premise (Dasgupta et al., 2018; Naik et al., 2018). Crucially, if the model is using this heuristic, it will predict entailment for 1 as well, even though that label is incorrect in this case:

\ex

. Premise: The actor was paid by the judge.
Hypothesis: The actor paid the judge.

We introduce a new evaluation set called HANS (Heuristic Analysis for NLI Systems)11 1 https://github.com/tommccoy1/hans designed to diagnose the use of such fallible structural heuristics. We design our dataset such that models that employ these heuristics are guaranteed to fail on particular classes of examples in our dataset, rather than simply show weaker performance. We target three heuristics, defined in Table 1. While these heuristics often yield correct labels, they clearly do not correspond to our intuitions of what natural language understanding requires.

We evaluate four popular NLI models, including the state-of-the-art BERT model (Devlin et al., 2018), on the HANS dataset. All models performed substantially below chance on this dataset, barely exceeding 0% accuracy in most classes. We conclude that their behavior is consistent with the hypothesis that they have adopted these heuristics.

Contributions:

This paper has two main contributions. First, we introduce the HANS dataset, an NLI evaluation set that tests specific hypotheses about invalid heuristics that NLI models are likely to learn. Second, we use this dataset to illuminate interpretable shortcomings in state-of-the-art models trained on MNLI (Williams et al., 2018b); these shortcoming may arise from insufficient inductive biases of the models, from insufficient signal provided by training datasets, or both. These results indicate that there is substantial room for improvement for these models and datasets, and that HANS can serve as a tool for motivating and measuring improvement in this area.

Heuristic Premise Hypothesis Label
Lexical The banker near the judge saw the actor. The banker saw the actor. E
overlap The lawyer was advised by the actor. The actor advised the lawyer. E
heuristic The judge and the artist saw the doctor. The judge saw the doctor. E
The doctors visited the lawyer. The lawyer visited the doctors. N
The judge by the actor stopped the banker. The banker stopped the actor. N
The actor and the judge saw the artist. The judge saw the actor. N
Subsequence The artist and the student called the judge. The student called the judge. E
heuristic Angry tourists helped the lawyer. Tourists helped the lawyer. E
The students ate the fruit. The students ate. E
The judges heard the actors resigned. The judges heard the actors. N
The senator near the lawyer danced. The lawyer danced. N
After the president hid the banker danced. The president hid the banker. N
Constituent Before the actor slept, the senator ran. The actor slept. E
euristic The lawyer knew that the judges shouted. The judges shouted. E
The senators slept, and the judge waited. The senators slept. E
If the actor slept, the judge saw the artist. The actor slept. N
The lawyers resigned, or the artist slept. The artist slept. N
Supposedly the officers saw the lawyer. The officers saw the lawyer. N
Table 2: Examples of sentences used to test the three heuristics. The label column shows the correct label for the sentence pair; E stands for entailment and N stands for non-entailment. A model relying on the heuristics would label all examples as entailment (incorrectly for those marked as N).

2 Syntactic Heuristics

We focus on three heuristics: the lexical overlap heuristic, the subsequence heuristic, and the constituent heuristic, all defined in Table 1. For the reasons we explain below, these heuristics are likely to be adopted by a statistical learner trained to perform NLI using standard training datasets such as SNLI (Bowman et al., 2015) or MNLI (Williams et al., 2018b).

These heuristics form a hierarchy: The constituent heuristic is a special case of the subsequence heuristic, which is a special case of the lexical overlap heuristic. Table 2 gives examples where each heuristic succeeds and fails.

There are two main reasons why we expect that NLI models might adopt these heuristics. First, the MNLI training set contains far more examples that support the heuristics than examples that contradict them, as the following table demonstrates:22 2 In this table, the counts for the lexical overlap heuristic include examples of the other two heuristics, and the counts for the subsequence heuristic include the counts for the constituent heuristic.

Heuristic Supporting Cases Contradicting Cases
Lexical overlap 2,158 261
Subsequence 1,274 72
Constituent 1,004 58

For all three heuristics, there are far more supporting cases than contradicting cases. Therefore, it is likely that a statistical learner would adopt these heuristics. Since MNLI contains data from multiple genres, we conjecture that the scarcity of contradicting examples is not an unusual property of a particular genre, but rather a general property of NLI training sets generated in the approach followed by MNLI.

Second, the input representations in current neural NLI systems lend themselves to the adoption of these heuristics. Each heuristic is related to a specific method for representing sentences. The lexical overlap heuristic appeals to the words in a sentence regardless of their positions, and so is likely to be adopted by bag-of-words NLI models (e.g., Parikh et al. 2016). The subsequence heuristic appeals to linearly adjacent chunks of words, so one might expect that it would be adopted by linear RNNs, which process sentences in linear order. Finally, the constituent heuristic appeals to components of the parse tree, so one might expect to see it adopted by tree-based NLI models (Bowman et al., 2016).

Since these three heuristics are closely related to each other, it is also likely that an NLI model would adopt all of them, even the ones that do not specifically target that class of model. For example, sequential RNNs do process individual words, so they are likely susceptible to the lexical overlap heuristic; and, as mentioned before, the subsequence and constituent heuristic are special cases of the lexical overlap heuristic, so a bag-of-words model is likely to be susceptible to all three heuristics.

3 Dataset Composition

For each of the three heuristics, we generated five templates of example types that support the heuristic and five templates of example types that refute the heuristic. Below are two templates used for the subsequence heuristic; see Appendix A for a full list of templates.

  • The N1 V the N2 P the N3 The N1 V the N2

    • The judge saw the actor near the scientists.
      The judge saw the actor.

    • The bankers called the senator by the artist.
      The bankers called the senator.

  • The N1 P the N2 V. The N2 V.

    • The doctor behind the lawyer ran.
      The lawyer ran.

    • The scientists by the judge slept.
      The judge slept.

We generated 1,000 examples from each template, for a total of 10,000 examples per heuristic. Since some heuristics are special cases of others, the examples for a given heuristic are ones that can be classified as instances of that heuristic but not of a more narrowly defined heuristic. Specifically, the lexical overlap examples are ones where every word in the hypothesis is in the premise, but where the hypothesis is not a subsequence or constituent of the premise; and the subsequence examples are ones where the hypothesis is a subsequence, but not a constituent, of the premise.

3.1 Dataset Controls

Plausibility:

One advantage of our templatic generation system over using modified examples from existing corpora is that we can ensure that all generated sentences are grammatical and semantically plausible; for example, we do not generate cases such as The student read the book The book read the student. To achieve this, we structured our vocabulary around Ettinger et al. (2018), who chose a vocabulary such that any noun can plausibly be the subject of any verb or the object of any transitive verb. Some of our templates required expanding this basic vocabulary; in those cases, we manually curated the additions to ensure plausibility.

Sentence simplicity:

All of the sentences we used had relatively simple structures, generally consisting of a simple transitive sentence (i.e. “the noun verbed the noun”) or a simple intransitive sentence (i.e. “the noun verbed”). The sentences contained at most one noun phrase modifier (either a prepositional phrase or relative clause). Other structures were only introduced if necessary to illustrate a given heuristic. To avoid potential coreference ambiguity, we excluded sentences that had a content word repeated. Finally, all verbs were in the past tense so that agreement was not a factor in our examples.

Selectional criteria:

Some example types depend on verbs that can have different types of complements (subcategorization frames). For example, the following example requires recognition of the fact that believed can take a clause (the lawyer saw the officer) as its complement:

\ex

. The doctor believed the lawyer saw the officer. The doctor believed the lawyer.

It is arguably unfair to expect a model to understand this example if it had only ever encountered believe with a noun phrase object. Therefore, we only chose verbs that appeared at least 50 times in the MNLI training set in all relevant frames.

4 Experimental Setup

We evaluate four existing models on this dataset:

Decomposable Attention (DA):

Introduced by Parikh et al. (2016), this model uses a form of attention to align words in the premise and in the hypothesis and to make predictions based on the aggregation of this alignment. It does not use any information about word order and can thus be thought of as a bag-of-words model.

ESIM:

The Enhanced Sequential Inference Model (ESIM; Chen et al., 2017) uses a modified version of a bidirectional LSTM to encode sentences. We use the variant with a sequential encoder, rather than the tree-based HIM variant.

SPINN:

The Stack-augmented Parser-Interpreter Neural Network (SPINN; Bowman et al., 2016) is a tree-based model that creates sentence representations by composing together the phrases of the sentence in accordance with a syntactic parse. We use the SPINN-PI-NT variant, which takes parse trees as part of its input (rather than learning to parse on its own). During training, we use the parse trees provided with the MNLI dataset; for the HANS evaluation set, our templates also include templates for each sentence’s parse tree. These parse templates are based on the parses generated by the Stanford PCFG Parser 3.5.2 (Klein and Manning, 2003), the same parser used to generate the parses in MNLI. Based on manual inspection, this parser generally provided reasonable parses for the HANS examples.

Correct: Entailment Correct: Non-entailment
Model Model class MNLI Lexical Subseq. Const. Lexical Subseq. Const.
DA Bag-of-words 0.72 0.99 1.00 0.97 0.01 0.02 0.03
ESIM RNN 0.77 0.98 1.00 0.99 0.01 0.02 0.01
SPINN TreeRNN 0.67 0.96 0.96 0.93 0.02 0.06 0.11
BERT Transformer 0.84 0.95 0.99 0.98 0.16 0.04 0.16
Table 3: Results. The MNLI column reports accuracy on the MNLI test set. The remaining columns report accuracies on 6 sub-components of the HANS evaluation set; each sub-component is defined by its correct label (either entailment or non-entailment) and the heuristic it addresses.

BERT:

The Bidirectional Encoder Representations from Transformers model (BERT; Devlin et al., 2018) is a Transformer model that uses attention, rather than recurrence, to process sentences. It is the current state-of-the-art model for MNLI. We use the bert-base-uncased pretrained model and fine-tune it on MNLI.

Motivation for model choice:

The HANS dataset is focused on three heuristics based on different types of structures (namely individual words, subsequences, and constituents). Therefore, we chose the first three models listed above because each uses a different type of structure for passing the input sentence to the model: DA uses a bag-of-words structure, ESIM uses a sequential structure, and SPINN uses a tree structure. We also included BERT because it is the current state-of-the-art on this task.

Implementation and evaluation:

For DA and ESIM, we used the implementations from AllenNLP (Gardner et al., 2017). For SPINN33 3 https://github.com/stanfordnlp/spinn; we used the NYU fork at https://github.com/nyu-mll/spinn and BERT,44 4 https://github.com/google-research/bert we used code from the GitHub repositories for the papers about those models.

We trained all models on MNLI. Although MNLI uses three labels (entailment, contradiction, and neutral), the HANS evaluation set only uses two (entailment and non-entailment) because, for some HANS examples, it is unclear whether the correct answer should be contradiction or neutral. For evaluating a model on HANS, we considered the model’s label to be the one out of entailment, contradiction, or neutral that scored the highest; we then translated contradiction or neutral labels to non-entailment. An alternate approach would have been to add together the scores for contradiction and neutral to determine a score for non-entailment; in practice, we found little difference between these two approaches, since the models almost always assigned more than 50% of the label probability to a single label.

5 Results

The results of our experiments are shown in Table 3. All four models achieved high scores on the MNLI test set. We do not know of past work that trained the DA model on MNLI, but for the other three models our results replicated the accuracies found in past work (Williams et al. 2018b for ESIM, Williams et al. 2018a for SPINN, and Devlin et al. 2018 for BERT).

On the HANS dataset, all models almost always assigned the correct label in the cases where the label is entailment, i.e. where the correct answer is in line with the hypothesized heuristics. However, they all performed poorly—their accuracy was less than 10% in most cases, when chance is 50%—on the cases where the heuristics make the incorrect predictions. Thus, despite their strong performance on the MNLI test set, all four models behaved in a way consistent with the use of the heuristics targeted in HANS.

Comparison of models:

Both DA and ESIM had near-zero performance across all three heuristics. These models might therefore make no distinction between the three heuristics, but instead treat all of them as the same phenomenon, i.e. lexical overlap. Indeed, for DA, this must be the case, as the model’s architecture does not give it access to word order; ESIM does in theory have access to word order information but does not appear to use it here.

SPINN performed best of all the models on the subsequence case. This might be due to the tree-based nature of its input; since the subsequences are not constituents, they do not form cohesive units in SPINN’s input in the same way that they do for sequential models. SPINN also notably outperformed DA and ESIM on the constituent cases; this suggests that the tree-based nature of SPINN did help it learn to understand the contributions of specific constituents to the sentence as a whole. SPINN also did worse than the other models on the constituent cases where the correct answer is entailment. This serves as further evidence that SPINN is not as likely as the other models to assume that constituents of the premise are entailed; this difference appears to moderately harm its performance in cases where that assumption leads to the correct answer.

Finally, though BERT did slightly worse than SPINN on the subsequence cases, it performed noticeably less poorly (though still far below chance) than the other models at both the constituent cases and the lexical overlap cases. Its performance was particularly striking on the lexical overlap case, suggesting that some of BERT’s success at MNLI may be attributable to a greater tendency to incorporate word order information than is utilized by other models.

Analysis of particular example types:

A finer-grained look at the results shows that, in the cases where a model’s performance was perceptibly above zero, its accuracy was not evenly spread across all the subcases within that heuristic (the complete fine-grained results can be found in Appendix B). For example, within the lexical overlap cases, BERT achieved 39% accuracy on conjunction cases (e.g., The actor and the doctor saw the artist. The actor saw the doctor.) but 0% accuracy on the subject/object swap cases (The doctor advised the lawyer. The lawyer advised the doctor.) Within the constituent heuristic cases, BERT achieved 49% accuracy at determining that a clause embedded under if and other conditional words is not entailed (If the doctor resigned, the lawyer advised the artist. The doctor resigned.), but 0% accuracy at identifying that the clause outside of the conditional clause is also not entailed (If the doctor resigned, the lawyer advised the artist. The lawyer advised the artist.).

Though BERT and SPINN both achieved notably higher scores on the constituent cases than the other models, these two models differed in the specific subcases they handled best. For example, SPINN obtained 20% accuracy on examples with or (e.g., The doctor danced, or the lawyer resigned. The lawyer resigned.), while BERT got only 1% accuracy on this case; yet BERT got 22% accuracy on clauses embedded under non-factive verbs (e.g., The doctor believed that the lawyer resigned. The lawyer resigned.), while SPINN got only 1% accuracy on this case. These varying results suggest that certain constructions are easier to learn by some architectures than others.

6 Discussion

Apportioning the blame: architecture or training set?

The behavior of a trained model depends on both the training set and the model’s architecture. The models’ poor results on HANS could therefore arise from architectural limitations, from insufficient signal in the MNLI training set, or from both.

The fact that SPINN did markedly better at the constituent and subsequence cases than ESIM and DA, even though the three models were trained on the exact same dataset, suggests that MNLI does contain some signal relevant to HANS; SPINN’s structural inductive biases allow it to leverage this signal, but the other models miss it.

At the same time, other recent works provide reasons to think that the models’ poor performance is due in large part to insufficient signal from the MNLI training set, rather than the models’ representational capacities alone. First, the same pretrained BERT model we used (namely, the bert-base-uncased model) was found by Goldberg (2019) to achieve strong results on syntactic evaluations, suggesting that this pretrained model does have access to relevant syntactic information. For example, BERT performs very well at evaluations based on subject-verb agreement, a phenomenon that requires a distinction to be made between the subject and direct object of a sentence (Linzen et al., 2016; Gulordava et al., 2018; Marvin and Linzen, 2018); it is unlikely that our fine-tuning step on MNLI substantially changed the model’s representational capabilities. Despite this fact, BERT’s accuracy at the subject-object swap cases was 0%, suggesting that even though it has access to information about subjects and objects, the MNLI task did not make it clear enough how that information applies to inference. In a similar vein, McCoy et al. (2019) found little evidence of compositional structure in the InferSent model (Conneau et al., 2017), which was trained on SNLI, even though other RNNs did learn clear compositional structure when trained on tasks that required such structure. Thus, the results of McCoy et al. (2019) also suggest that the training set rather than the architecture might be to blame for poor compositional understanding.

Finally, our BERT-based model differed from the other three models in that it was pretrained on a massive amount of data on a masking task and a next-sentence classification task, followed by fine-tuning on MNLI, while the other models were only trained on MNLI; we therefore cannot rule out the possibility that much of BERT’s comparative success at HANS was due to the greater amount of data it has been exposed to rather than any of its architectural features.

Is the dataset too difficult?

It is possible that the models’ poor performance arose because HANS is simply unfairly hard. However, we believe that most of the cases included in HANS are ones that humans generally have no difficulty with. A possible exception is a subset of the subsequence cases, about which humans have been shown to incorrectly answer comprehension questions. For example, in an experiment in which participants read the sentence As Jerry played the violin gathered dust in the attic, some participants answered yes to the question Did Jerry play the violin? (Christianson et al., 2001). However, given that all of our examples are either readily solved by humans, or are understandable after only brief inspection, it seems reasonable to expect NLP models to be able to handle them.

Is the poor performance due to domain mismatch?

HANS consists of constructed sentences. To minimize the possibility of domain mismatch stemming from limited exposure to particular words in the test set, we were careful to only choose lexical items and verb frames that were well-represented in the training set. Moreover, MNLI was designed specifically with the goal of being multi-genre in the hope that it would lead to cross-domain learning; indeed, performance of trained models degrades only slightly when tested across domains (Williams et al., 2018b).

7 Related Work

7.1 Analyzing trained models

This project relates to an extensive body of research focused on exposing and understanding weaknesses in models’ learned behavior and representations. In the NLI literature, Poliak et al. (2018b) and Gururangan et al. (2018) show that, due to biases in NLI datasets, it is possible to achieve reasonable success at NLI by only looking at the hypothesis. Several other recent works study possible ways in which NLI models might use fallible heuristics rather than valid reasoning to solve the NLI task. Some of these works focus on specific semantic phenomena, such as lexical inferences (Glockner et al., 2018) and quantifiers (Geiger et al., 2018), or biases based on specific words (Sanchez et al., 2018). Our work focuses instead on how models might be led astray by an improper understanding of structure, an aim that extends the proof-of-concept work done by Dasgupta et al. (2018). Our focus on using NLI to address how models understand structure follows some older work about using NLI for the evaluation of parsers (Rimell and Clark, 2010; Mehdad et al., 2010).

Outside the NLI probing literature, multiple projects have used classification tasks to understand what linguistic and/or structural information is captured by vector encodings (e.g. Adi et al., 2017; Ettinger et al., 2018; Conneau et al., 2018). We instead choose the behavioral approach of using NLI examples to test models for two reasons. First, this approach, unlike the classification approach, is agnostic to the type of model being used; our dataset could be used to evaluate a symbolic NLI system just as easily as a neural one, whereas classification tasks only work for models that use vector representations. This flexibility has motivated the use of NLI as a way to reframe a diverse array of other linguistic tasks (Poliak et al., 2018a; White et al., 2017). Second, the fact that a particular piece of information can be decoded from an intermediate vector representation does not necessarily indicate that this information is used by the model; behavioral measures address this problem.

7.2 Heuristics in NLI systems

Similar in spirit to our lexical overlap heuristic, Dasgupta et al. (2018), Nie et al. (2018), and Kim et al. (2018) also studied the performance of NLI models on specific phenomena where word order matters; however, we use a larger set of phenomena with the aim of studying lexical overlap in general rather than within specific phenomena. Naik et al. (2018) also find evidence that NLI models use a lexical overlap heuristic, but our approach is substantially different from theirs.55 5 Naik et al. (2018) diagnose the lexical overlap heuristic by appending and true is true to existing MNLI hypotheses, which decreases the lexical overlap with the premise but does not change the label for the sentence pair. We instead generate new sentence pairs for which the words in the hypothesis all appear in the premise.

These past works all analyze compositionality by studying cases that are effectively subcases of our lexical overlap heuristic. The HANS dataset generalizes this approach by including the other two heuristics. We add these heuristics in recognition of the fact that compositional understanding does not just require keeping track of word order, but also recognizing where syntactic boundaries occur (the failure of which is studied by our subsequence heuristic) and recognizing that, even with a perfect syntactic parse of the sentence, it is insufficient to simply rely on the parse regardless of the words in it (the failure of which is studied by the constituent heuristic).

This work builds on our pilot study in McCoy and Linzen (2019), which studied one of the ten subcases of the subsequence heuristic. Several of our subcases in the subsequence heuristic are inspired by psycholinguistics research on garden path sentences (Bever, 1970; Frazier and Rayner, 1982) and local coherence (Tabor et al., 2004); these works have similar aims as our subsequence heuristic but are concerned with the representations used by humans in online sentence processing rather than the representations used by neural networks.

Finally, all of our constituent heuristic subcases depend on understanding the implicational behavior of specific words. Several past works (Pavlick and Callison-Burch, 2016; Rudinger et al., 2018; White et al., 2018; White and Rawlins, 2018) have studied such behavior for verbs (e.g., He knows it is raining entails It is raining, while He believes it is raining does not). We extend that approach by including several other types of words with specific implicational behavior, namely conjunctions (and, or), prepositions that take clausal arguments (if, because), and adverbs (definitely, supposedly). MacCartney and Manning (2009) also discuss the implicational behavior of these various types of words within NLI.

7.3 Generalization

Our work suggests that test sets drawn from the same distribution as the training set may not be adequate for determining whether the model has learned the task it was intended to learn. Instead, it is also necessary to evaluate on a generalization set that departs from the training distribution. McCoy et al. (2018) reached a similar conclusion for the task of forming a question from a declarative sentence; different architectures that all succeeded on the test set failed on the generalization set in different ways, indicating that the test set alone was not sufficient to determine what the models had learned. This same effect can arise not just from using different architectures but also from different random initializations of the same architecture (Weber et al., 2018).

8 Conclusions

Statistical learners learn based on regularities in their training sets. This process makes them vulnerable to adopting heuristics that are valid for frequent cases but make errors on less frequent ones. We have investigated three such heuristics that we hypothesize NLI models are likely to learn. To evaluate whether NLI models do behave consistently with these heuristics, we have introduced the HANS dataset, on which models using these heuristics are guaranteed to fail. We find that four existing NLI models perform very poorly on HANS, suggesting that their high accuracies on NLI test sets may be due to the exploitation of invalid heuristics rather than deeper understanding of natural language. These results indicate that, despite the impressive accuracies of these models on standard evaluations, there is still much progress to be made and that targeted, challenging evaluations, such as HANS, are important for determining whether models are learning what they are intended to be learning.

Acknowledgments

We are grateful to Adam Poliak, Benjamin Van Durme, Samuel Bowman, the members of the JSALT General-Purpose Sentence Representation Learning team, and the members of the Johns Hopkins Computation and Psycholinguistics lab for helpful comments. Any errors remain our own.

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. 1746891 and the 2018 Jelinek Summer Workshop on Speech and Language Technology (JSALT). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or the JSALT workshop.

References

  • Adi et al. (2017) Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. 2017. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. In International Conference on Learning Representations.
  • Agrawal et al. (2016) Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. 2016. Analyzing the behavior of visual question answering models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1955–1960. Association for Computational Linguistics.
  • Bever (1970) Thomas G. Bever. 1970. The cognitive basis for linguistic structures. Cognition and the development of language, 279(362):1–61.
  • Bowman et al. (2015) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
  • Bowman et al. (2016) Samuel R. Bowman, Jon Gauthier, Abhinav Rastogi, Raghav Gupta, Christopher D. Manning, and Christopher Potts. 2016. A fast unified model for parsing and sentence understanding. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1466–1477. Association for Computational Linguistics.
  • Chen et al. (2017) Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced LSTM for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1657–1668. Association for Computational Linguistics.
  • Christianson et al. (2001) Kiel Christianson, Andrew Hollingworth, John F Halliwell, and Fernanda Ferreira. 2001. Thematic roles assigned along the garden path linger. Cognitive psychology, 42(4):368–407.
  • Condoravdi et al. (2003) Cleo Condoravdi, Dick Crouch, Valeria de Paiva, Reinhard Stolle, and Daniel G. Bobrow. 2003. Entailment, intensionality and text understanding. In Proceedings of the HLT-NAACL 2003 Workshop on Text Meaning.
  • Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680. Association for Computational Linguistics.
  • Conneau et al. (2018) Alexis Conneau, Germán Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. 2018. What you can cram into a single vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136. Association for Computational Linguistics.
  • Dagan et al. (2006) Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL Recognising Textual Entailment Challenge. In Proceedings of the First International Conference on Machine Learning Challenges: Evaluating Predictive Uncertainty Visual Object Classification, and Recognizing Textual Entailment, MLCW’05, pages 177–190, Berlin, Heidelberg. Springer-Verlag.
  • Dasgupta et al. (2018) Ishita Dasgupta, Demi Guo, Andreas Stuhlmüller, Samuel J Gershman, and Noah D Goodman. 2018. Evaluating compositionality in sentence embeddings. arXiv preprint arXiv:1802.04302.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
  • Ettinger et al. (2018) Allyson Ettinger, Ahmed Elgohary, Colin Phillips, and Philip Resnik. 2018. Assessing composition in sentence vector representations. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1790–1801. Association for Computational Linguistics.
  • Frazier and Rayner (1982) Lyn Frazier and Keith Rayner. 1982. Making and correcting errors during sentence comprehension: Eye movements in the analysis of structurally ambiguous sentences. Cognitive Psychology, 14(2):178–210.
  • Gardner et al. (2017) Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. 2017. AllenNLP: A Deep Semantic Natural Language Processing Platform.
  • Geiger et al. (2018) Atticus Geiger, Ignacio Cases, Lauri Karttunen, and Christopher Potts. 2018. Stress-testing neural models of natural language inference with multiply-quantified sentences. arXiv preprint arXiv:1810.13033.
  • Glockner et al. (2018) Max Glockner, Vered Shwartz, and Yoav Goldberg. 2018. Breaking NLI Systems with Sentences that Require Simple Lexical Inferences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 650–655. Association for Computational Linguistics.
  • Goldberg (2019) Yoav Goldberg. 2019. Assessing BERT’s syntactic abilities. arXiv preprint arXiv:1901.05287.
  • Gulordava et al. (2018) Kristina Gulordava, Piotr Bojanowski, Edouard Grave, Tal Linzen, and Marco Baroni. 2018. Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1195–1205. Association for Computational Linguistics.
  • Gururangan et al. (2018) Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112. Association for Computational Linguistics.
  • Kim et al. (2018) Juho Kim, Christopher Malon, and Asim Kadav. 2018. Teaching syntax by adversarial distraction. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), pages 79–84. Association for Computational Linguistics.
  • Klein and Manning (2003) Dan Klein and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics.
  • Linzen et al. (2016) Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521–535.
  • MacCartney and Manning (2009) Bill MacCartney and Christopher D Manning. 2009. Natural language inference. Stanford University Stanford.
  • Marvin and Linzen (2018) Rebecca Marvin and Tal Linzen. 2018. Targeted syntactic evaluation of language models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202. Association for Computational Linguistics.
  • McCoy et al. (2018) R. Thomas McCoy, Robert Frank, and Tal Linzen. 2018. Revisiting the poverty of the stimulus: Hierarchical generalization without a hierarchical bias in recurrent neural networks. In Proceedings of the 40th Annual Conference of the Cognitive Science Society, pages 2093—2098, Austin, TX.
  • McCoy and Linzen (2019) R. Thomas McCoy and Tal Linzen. 2019. Non-entailed subsequences as a challenge for natural language inference. In Proceedings of the Society for Computation in Linguistics, volume 2.
  • McCoy et al. (2019) R. Thomas McCoy, Tal Linzen, Ewan Dunbar, and Paul Smolensky. 2019. RNNs implicitly implement tensor-product representations. In International Conference on Learning Representations.
  • Mehdad et al. (2010) Yashar Mehdad, Alessandro Moschitti, and Fabio Massimo Zanzotto. 2010. Syntactic/semantic structures for textual entailment recognition. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 1020–1028. Association for Computational Linguistics.
  • Naik et al. (2018) Aakanksha Naik, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose, and Graham Neubig. 2018. Stress test evaluation for natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2340–2353. Association for Computational Linguistics.
  • Nie et al. (2018) Yixin Nie, Yicheng Wang, and Mohit Bansal. 2018. Analyzing compositionality-sensitivity of nli models. arXiv preprint arXiv:1811.07033.
  • Parikh et al. (2016) Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable attention model for natural language inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2249–2255. Association for Computational Linguistics.
  • Pavlick and Callison-Burch (2016) Ellie Pavlick and Chris Callison-Burch. 2016. Tense manages to predict implicative behavior in verbs. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2225–2229. Association for Computational Linguistics.
  • Poliak et al. (2018a) Adam Poliak, Aparajita Haldar, Rachel Rudinger, J. Edward Hu, Ellie Pavlick, Aaron Steven White, and Benjamin Van Durme. 2018a. Collecting diverse natural language inference problems for sentence representation evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 67–81. Association for Computational Linguistics.
  • Poliak et al. (2018b) Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018b. Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 180–191. Association for Computational Linguistics.
  • Rimell and Clark (2010) Laura Rimell and Stephen Clark. 2010. Cambridge: Parser evaluation using textual entailment by grammatical relation comparison. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 268–271. Association for Computational Linguistics.
  • Rudinger et al. (2018) Rachel Rudinger, Aaron Steven White, and Benjamin Van Durme. 2018. Neural models of factuality. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 731–744. Association for Computational Linguistics.
  • Sanchez et al. (2018) Ivan Sanchez, Jeff Mitchell, and Sebastian Riedel. 2018. Behavior analysis of NLI models: Uncovering the influence of three factors on robustness. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1975–1985. Association for Computational Linguistics.
  • Tabor et al. (2004) Whitney Tabor, Bruno Galantucci, and Daniel Richardson. 2004. Effects of merely local syntactic coherence on sentence processing. Journal of Memory and Language, 50(4):355–370.
  • Wang et al. (2018) Jianyu Wang, Zhishuai Zhang, Cihang Xie, Yuyin Zhou, Vittal Premachandran, Jun Zhu, Lingxi Xie, and Alan Yuille. 2018. Visual concepts and compositional voting. Annals of Mathematical Sciences and Applications, 3(1):151–188.
  • Weber et al. (2018) Noah Weber, Leena Shekhar, and Niranjan Balasubramanian. 2018. The fine line between linguistic generalization and failure in seq2seq-attention models. In Proceedings of the Workshop on Generalization in the Age of Deep Learning, pages 24–27. Association for Computational Linguistics.
  • White et al. (2017) Aaron Steven White, Pushpendre Rastogi, Kevin Duh, and Benjamin Van Durme. 2017. Inference is everything: Recasting semantic resources into a unified evaluation framework. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 996–1005. Asian Federation of Natural Language Processing.
  • White and Rawlins (2018) Aaron Steven White and Kyle Rawlins. 2018. The role of veridicality and factivity in clause selection. In Proceedings of the 48th Annual Meeting of the North East Linguistic Society,.
  • White et al. (2018) Aaron Steven White, Rachel Rudinger, Kyle Rawlins, and Benjamin Van Durme. 2018. Lexicosyntactic inference in neural models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4717–4724. Association for Computational Linguistics.
  • Williams et al. (2018a) Adina Williams, Andrew Drozdov, and Samuel R. Bowman. 2018a. Do latent tree learning models identify meaningful structure in sentences? Transactions of the Association of Computational Linguistics, 6:253–267.
  • Williams et al. (2018b) Adina Williams, Nikita Nangia, and Samuel Bowman. 2018b. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics.

Appendix A Templates

Tables 4, 5, and 6 contain the templates for the lexical overlap heuristic, the subsequence heuristic, and the constituent heuristic, respectively.

In some cases, a given template has multiple versions, such as one version where a noun phrase modifier attaches to the subject and another where the modifier attaches to the object. For clarity, we have only listed one version of each template here. The full list of templates can be viewed in the code on GitHub.66 6 https://github.com/tommccoy1/hans

Subcase Template Example
Entailment:
Untangling relative clauses
The N1 who the N2 V1 V2 the N3
The N2 V1 the N1.
The athlete who the judges admired called the manager.
The judges admired the athlete.
Entailment:
Sentences with PPs
The N1 P the N2 V the N3
The N1 V the N3
The tourists by the actor recommended the authors.
The tourists recommended the authors.
Entailment:
Sentences with relative clauses
The N1 that V2 V1 the N2
The N1 V1 the N2
The actors that danced saw the author.
The actors saw the author.
Entailment:
Conjunctions
The N1 V the N2 and the N3
The N1 V the N3
The secretaries encouraged the scientists and the actors.
The secretaries encouraged the actors.
Entailment:
Passives
The N1 were V by the N2
The N1 V the N2
The authors were supported by the tourists.
The tourists supported the authors.
Non-entailment:
Subject-object swap
The N1 V the N2.
The N2 V the N1.
The senators mentioned the artist.
The artist mentioned the senators.
Non-entailment:
Sentences with PPs
The N1 P the N2 V the N3
The N3 V the N2
The judge behind the manager saw the doctors.
The doctors saw the manager.
Non-entailment:
Sentences with relative clauses
The N1 V1 the N2 who the N3 V2
The N2 V1 the N3
The actors advised the manager who the tourists saw.
The manager advised the tourists.
Non-entailment:
Conjunctions
The N1 V the N2 and the N3
The N2 V the N3
The doctors advised the presidents and the tourists.
The presidents advised the tourists.
Non-entailment:
Passives
The N1 were V by the N2
The N1 V the N2
The senators were recommended by the managers.
The senators recommended the managers.
Table 4: Templates for the lexical overlap heuristic
Subcase Template Example
Entailment:
Conjunctions
The N1 and the N2 V the N3
The N2 V the N3
The actor and the professor mentioned the lawyer.
The professor mentioned the lawyer.
Entailment:
Adjectives
Adj N1 V the N2
N1 V the N2
Happy professors mentioned the lawyer.
Professors mentioned the lawyer.
Entailment:
Understood argument
The N1 V the N2
The N1 V
The author read the book.
The author read.
Entailment:
Relative clause on object
The N1 V1 the N2 that V2 the N3
The N1 V1 the N2
The artists avoided the senators that thanked the tourists.
The artists avoided the senators.
Entailment:
PP on object
The N1 V the N2 P the N3
The N1 V the N2
The authors supported the judges in front of the doctor.
The authors supported the judges.
Non-entailment:
NP/S
The N1 V1 the N2 V2 the N3
The N1 V1 the N2
The managers heard the secretary encouraged the author.
The managers heard the secretary.
Non-entailment:
PP on subject
The N1 P the N2 V
The N2 V
The managers near the scientist resigned.
The scientist resigned.
Non-entailment:
Relative clause on subject
The N1 that V1 the N2 V2 the N3
The N2 V2 the N3
The secretary that admired the senator saw the actor.
The senator saw the actor.
Non-entailment:
MV/RR
The N1 V1 P the N2 V2
The N1 V1 P the N2
The senators paid in the office danced.
The senators paid in the office.
Non-entailment:
NP/Z
P the N1 V1 the N2 V2 the N3
The N1 V1 the N2
Before the actors presented the professors advised the manager.
The actors presented the professors.
Table 5: Templates for the subsequence heuristic
Subcase Template Example
Entailment:
Embedded under preposition
P the N1 V1, the N2 V2 the N3
The N1 V1
Because the banker ran, the doctors saw the professors.
The banker ran.
Entailment:
Outside embedded clause
P the N1 V1 the N2, the N3 V2 the N4
The N3 V2 the N4
Although the secretaries recommended the managers, the judges supported the scientist.
The judges supported the scientist.
Entailment:
Embedded under verb
The N1 V1 that the N2 V2
The N2 V2
The president remembered that the actors performed.
The actors performed.
Entailment:
Conjunction
The N1 V1, and the N2 V2 the N3.
The N2 V2 the N3
The lawyer danced, and the judge supported the doctors.
The judge supported the doctors.
Entailment:
Adverbs
Adv the N V
The N V
Certainly the lawyers resigned.
The lawyers resigned.
Non-entailment:
Embedded under preposition
P the N1 V1, the N2 V2 the N2
The N1 V1
Unless the senators ran, the professors recommended the doctor.
The senators ran.
Non-entailment:
Outside embedded clause
P the N1 V1 the N2, the N3 V2 the N4
The N3 V2 the N4
Unless the authors saw the students, the doctors helped the bankers.
The doctors helped the bankers.
Non-entailment:
Embedded under verb
The N1 V1 that the N2 V2 the N3
The N2 V2 the N3
The tourists said that the lawyer saw the banker.
The lawyer saw the banker.
Non-entailment:
Disjunction
The N1 V1, or the N2 V2 the N3
The N2 V2 the N3
The judges resigned, or the athletes mentioned the author.
The athletes mentioned the author.
Non-entailment:
Adverbs
Adv the N1 V the N2
The N1 V the N2
Probably the artists saw the authors.
The artists saw the authors.
Table 6: Templates for the constituent heuristic

Appendix B Fine-grained results

Table 7 shows the results by subcase for the subcases where the correct answer is entailment. Table 8 shows the results by subcase for the subcases where the correct answer is non-entailment.

Heuristic Subcase DA ESIM SPINN BERT
Lexical Untangling relative clauses 0.97 0.95 0.88 0.98
overlap Sentences with PPs 1.00 1.00 1.00 1.00
Sentences with relative clauses 0.98 0.97 0.97 0.99
Conjunctions 1.00 1.00 1.00 0.77
Passives 1.00 1.00 0.95 1.00
Subsequence Conjunctions 1.00 1.00 1.00 0.98
Adjectives 1.00 1.00 1.00 1.00
Understood argument 1.00 1.00 0.84 1.00
Relative clause on object 0.98 0.99 0.95 0.99
PP on object 1.00 1.00 1.00 1.00
Constituent Embedded under preposition 0.99 0.99 0.85 1.00
Outside embedded clause 0.94 1.00 0.95 1.00
Embedded under verb 0.92 0.94 0.99 0.99
Conjunction 0.99 1.00 0.89 1.00
Adverbs 1.00 1.00 0.98 1.00
Table 7: Results for the subcases where the correct label is entailment.
Heuristic Subcase DA ESIM SPINN BERT
Lexical Subject-object swap 0.00 0.00 0.03 0.00
overlap Sentences with PPs 0.00 0.00 0.01 0.25
Sentences with relative clauses 0.04 0.04 0.06 0.18
Conjunctions 0.00 0.00 0.01 0.39
Passives 0.00 0.00 0.00 0.00
Subsequence NP/S 0.04 0.02 0.09 0.02
PP on subject 0.00 0.00 0.00 0.06
Relative clause on subject 0.03 0.04 0.05 0.01
MV/RR 0.04 0.03 0.03 0.00
NP/Z 0.02 0.01 0.11 0.10
Constituent Embedded under preposition 0.14 0.02 0.29 0.50
Outside embedded clause 0.01 0.00 0.02 0.00
Embedded under verb 0.00 0.00 0.01 0.22
Disjunction 0.01 0.03 0.20 0.01
Adverbs 0.00 0.00 0.00 0.08
Table 8: Results for the subcases where the correct label is non-entailment.