Stress Test Evaluation for Natural Language Inference

Abstract

Natural language inference (NLI) is the task of determining if a naturallanguage hypothesis can be inferred from a given premise in a justifiablemanner. NLI was proposed as a benchmark task for natural languageunderstanding. Existing models perform well at standard datasets for NLI,achieving impressive results across different genres of text. However, theextent to which these models understand the semantic content of sentences isunclear. In this work, we propose an evaluation methodology consisting ofautomatically constructed "stress tests" that allow us to examine whethersystems have the ability to make real inferential decisions. Our evaluation ofsix sentence-encoder models on these stress tests reveals strengths andweaknesses of these models with respect to challenging linguistic phenomena,and suggests important directions for future work in this area.

Quick Read (beta)

loading the full paper ...