Commonsense Reasoning for Natural Language Understanding: A Survey of Benchmarks, Resources, and Approaches

  • 2019-04-02 02:09:01
  • Shane Storks, Qiaozi Gao, Joyce Y. Chai
  • 3


Commonsense knowledge and commonsense reasoning are some of the mainbottlenecks in machine intelligence. In the NLP community, many benchmarkdatasets and tasks have been created to address commonsense reasoning forlanguage understanding. These tasks are designed to assess machines' ability toacquire and learn commonsense knowledge in order to reason and understandnatural language text. As these tasks become instrumental and a driving forcefor commonsense research, this paper aims to provide an overview of existingtasks and benchmarks, knowledge resources, and learning and inferenceapproaches toward commonsense reasoning for natural language understanding.Through this, our goal is to support a better understanding of the state of theart, its limitations, and future challenges.


Quick Read (beta)

Commonsense Reasoning for Natural Language Understanding:
A Survey of Benchmarks, Resources, and Approaches

\nameShane Storks \email[email protected]
\nameQiaozi Gao \email[email protected]
\nameJoyce Y. Chai \email[email protected]
\addrDepartment of Computer Science and Engineering
Michigan State University
East Lansing, MI 48824 USA

Commonsense knowledge and commonsense reasoning are some of the main bottlenecks in machine intelligence. In the NLP community, many benchmark datasets and tasks have been created to address commonsense reasoning for language understanding. These tasks are designed to assess machines’ ability to acquire and learn commonsense knowledge in order to reason and understand natural language text. As these tasks become instrumental and a driving force for commonsense research, this paper aims to provide an overview of existing tasks and benchmarks, knowledge resources, and learning and inference approaches toward commonsense reasoning for natural language understanding. Through this, our goal is to support a better understanding of the state of the art, its limitations, and future challenges.

Commonsense Reasoning for Natural Language Understanding: A Survey of Benchmarks, Resources, and Approaches Shane Storks [email protected]
Qiaozi Gao [email protected]
Joyce Y. Chai [email protected]
Department of Computer Science and Engineering
Michigan State University
East Lansing, MI 48824 USA

1 Introduction

Commonsense knowledge and commonsense reasoning play a vital role in all aspects of machine intelligence, from language understanding to computer vision and robotics. A detailed account of challenges with commonsense reasoning is provided by ? (?), which spans from difficulties in understanding and formulating commonsense knowledge for specific or general domains to complexities in various forms of reasoning and their integration for problem solving. To move the field forward, as pointed out by ? (?), there is a critical need for methods that can integrate different modes of reasoning (e.g., symbolic reasoning through deduction and statistical reasoning based on large amount of data), as well as benchmarks and evaluation metrics that can quantitatively measure research progress.

In the NLP community, recent years have seen a surge of research activities that aim to tackle commonsense reasoning through ever-growing benchmark tasks. These range from earlier textual entailment tasks, e.g., the RTE Challenges (?), to more recent tasks that require a comprehensive understanding of everyday physical and social commonsense, e.g., the Story Cloze Test (?) or SWAG (?). An increasing effort has been devoted to extracting commonsense knowledge from existing data (e.g., Wikipedia) or acquiring it directly from crowd workers. Many learning and inference approaches have been developed for these benchmark tasks which range from earlier symbolic and statistical approaches to more recent neural approaches. To facilitate quantitative evaluation and encourage broader participation from the community, various leaderboards for these benchmarks have been set up and maintained.

As more and more resources become available for commonsense reasoning for NLP, it is useful for researchers to have a quick grasp of this fast evolving space. As such, this paper aims to provide an overview of existing benchmarks, knowledge resources, and approaches, and discuss current limitations and future opportunities. We hope this paper will provide an entry point to those who are not familiar with but interested in pursuing research in this topic area.

Figure 1: Main research efforts in commonsense knowledge and reasoning from the NLP community occur in three areas: benchmarks and tasks, knowledge resources, and learning and inference approaches.11 1 Although videos and images are often used for creating various benchmarks, they are not the focus of this paper.

As shown in Figure 1, ongoing research effort has focused on three main areas. The first, discussed further in Section 2, is the creation of benchmark datasets for various NLP tasks. Different from earlier benchmark tasks that mainly focus on linguistic processing, these tasks are designed so that the solutions may not be obvious from the linguistic context but rather require commonsense knowledge and reasoning. These benchmarks vary in terms of the scope of reasoning, for example, from the specific and focused coreference resolution task in the Winograd Schema Challenge (?) to broader tasks such as textual entailment, e.g., the RTE Challenges (?). While some benchmarks only focus on one specific task, others are comprised of a variety of tasks, e.g., GLUE (?). Some benchmarks target a specific type of commonsense knowledge, e.g., social psychology in Event2Mind (?), while others intend to address a variety of types of knowledge, e.g., bAbI (?). How benchmark tasks are formulated also differs among existing benchmarks. Some are in the form of multiple-choice questions, requiring a binary decision or a selection from a candidate list, while others are more open-ended. Different characteristics of benchmarks serve different goals, and a critical question is what criteria to consider in order to create a benchmark that can support technology development and measurable outcome of research progress on commonsense reasoning abilities. Section 2 gives a detailed account of existing benchmarks and their common characteristics. It also summarizes important criteria to consider in building these benchmarks.

A parallel ongoing research effort is the population of commonsense knowledge bases. Apart from earlier efforts where knowledge bases are manually created by domain experts, e.g., Cyc (?) or WordNet (?), or by crowdsourcing, e.g., ConceptNet (?), recent work has focused on applying NLP techniques to automatically extract information (e.g., facts and relations) and build knowledge representations such as knowledge graphs. Section 3 gives an introduction to existing commonsense knowledge resources and several recent efforts in building such resources.

To tackle benchmark tasks, many computational models have been developed from earlier symbolic and statistical approaches to more recent approaches that learn deep neural models from training data. These models are often augmented with external data or knowledge resources, or pre-trained word embeddings, e.g., BERT (?). Section 4 summarizes the state-of-the-art approaches, and discusses their performance and limitations.

It is worth pointing out that as commonsense reasoning plays an important role in almost every aspect of machine intelligence, recent years have also seen an increasing effort in other related disciplines such as computer vision, robotics, and the intersection between language, vision, and robotics. This paper will only focus on the development in the NLP field without concerning these other related areas. Nevertheless, the current practice, and the progress and limitations from the NLP field may also provide some insight to these related disciplines and vice versa.

2 Benchmarks and Tasks

The NLP community has a long history of creating benchmarks to facilitate algorithm development and evaluation for language processing tasks, e.g., named entity recognition, coreference resolution, and question answering. Although it is often the case that some type of commonsense reasoning may be required to reach an oracle performance on these tasks, earlier benchmarks have mainly targeted approaches that apply linguistic context to solve these tasks. As significant progress has been made by using the earlier benchmarks, recent years have seen a shift in benchmark tasks which are beyond the use of linguistic context, but rather require commonsense knowledge and reasoning to solve the tasks. For instance, consider this question from the Winograd Schema Challenge (?): "The trophy would not fit in the brown suitcase because it was too big. What was too big?" To answer this question, linguistic constraints will not be able to resolve whether it refers to the trophy or the brown suitcase. Only based on commonsense knowledge (i.e., an object must be bigger than another object in order to contain it) is it possible to resolve the pronoun it and answer the question correctly. It is the goal of this paper to survey these kinds of recent benchmarks which are geared toward the use of commonsense reasoning for NLP tasks beyond the use of linguistic discourse.

In this section, we first give an overview of recent benchmarks that address commonsense reasoning in NLP. We then summarize important considerations in creating these benchmarks and lessons learned from ongoing research.

2.1 An Overview of Existing Benchmarks

Since the Recognizing Textual Entailment (RTE) Challenges introduced by ? (?) in the early 2000s, there has been an explosion of commonsense-directed benchmarks being created. Figure 2 shows a trend of this growth in the field among the benchmarks introduced in this section. The RTE challenges, which were inspired by difficulties encountered in semantic processing for machine translation, question answering, and information extraction (?), had been the main dominating reasoning task for many years. Since 2013, however there has been a surge of a variety of benchmarks. A majority of them (e.g., those released in 2018) have provided more than 10,000 instances, aiming to facilitate the development of machine learning approaches.

Figure 2: Since the early 2000s, there has been a surge of benchmark tasks geared towards commonsense reasoning for language understanding. In 2018, we saw the creation of more benchmarks of larger sizes than ever before.

Many commonsense benchmarks are based upon classic language processing problems. The scope of these benchmark tasks ranges from more focused tasks, such as coreference resolution and named entity recognition, to more comprehensive tasks and applications, such as question answering and textual entailment. More focused tasks tend to be useful in creating component technology for NLP systems, building upon each other toward more comprehensive tasks.

Meanwhile, rather than restricting tasks by the types of language processing skills required to perform them, a common characteristic of earlier benchmarks, recent benchmarks are more commonly geared toward particular types of commonsense knowledge and reasoning. Some benchmark tasks focus on singular commonsense reasoning processes, e.g., temporal reasoning, requiring a small amount of commonsense knowledge, while others focus on entire domains of knowledge, e.g., social psychology, thus requiring a larger set of related reasoning skills. Further, some benchmarks include a more comprehensive mixture of everyday commonsense knowledge, demanding a more complete commonsense reasoning skill set.

Next, we give a review of widely used benchmarks, introduced by the following groupings: coreference resolution, question answering, textual entailment, plausible inference, psychological reasoning, and multiple tasks. These groupings are not necessarily exclusive, but they highlight the recent trends in commonsense. While all tasks’ training and/or development data are available for free download, it is important to note that test data are often not distributed publicly so that testing of systems can occur privately in a standard, unbiased way.

2.1.1 Coreference Resolution

One particularly fundamental focused task is coreference resolution, i.e., determining which entity or event in a text a particular pronoun refers to. This task becomes challenging, however, due to ambiguities which arise from the presence of multiple entities in a sentence with pronouns and significantly complicate the process, creating a need for commonsense knowledge to inform decisions (?). Even informed by a corpus or knowledge graph, current coreference resolution systems are plagued by data bias, e.g., gender bias in nouns for occupations (?). Since challenging examples in commonsense-informed coreference resolution are typically handcrafted or handpicked (?, ?, ?), there exists a small magnitude of data for this skill, and it remains an unsolved problem (?, ?). Consequently, there is a need for more benchmarks here. In the following paragraphs, we introduce the few benchmarks and tasks that currently exist for coreference resolution.

Winograd Schema Challenge.

The classic coreference resolution benchmark is the Winograd Schema Challenge (WSC), inspired by ? (?), originally proposed by ? (?), later developed by ? (?), and actualized by ? (?) and ? (?). In this challenge, systems are presented with questions about sentences known as Winograd schemas. To answer a question, a system must disambiguate a pronoun whose coreferent may be one of two entities, and can be changed by replacing a single word in the sentence. The first Challenge dataset consisted of just 60 testing examples (?), but there are more available online.22 2 See,, and
Additionally, ? (?) curated nearly 1,000 similar pronoun resolution problems to aid in training systems for the WSC; this set of problems is included within a later-mentioned dataset, and thus is not introduced separately. Further information about the Winograd Schema Challenge is available at

Other coreference resolution tasks.

Other coreference resolution tasks include the basic and compound coreference tasks within bAbI (?), the definite pronoun resolution problems by ? (?) within Inference is Everything (?), the Winograd NLI task within GLUE (?), and the gendered anaphora-based Winogender task originally by ? (?) within DNC (?). The full benchmarks to which these tasks belong are introduced in Section 2.1.6, while in Figure 3, we list several examples of challenging coreference resolution problems across all existing benchmarks, where commonsense knowledge, such as the fact that predators eat their prey, is useful in disambiguating pronouns.

  1. 1.

    Winograd Schema Challenge (?)
    The trophy would not fit in the brown suitcase because it was too big. What was too big?

    1. (a)

      The trophy

    2. (b)

      The suitcase

    The trophy would not fit in the brown suitcase because it was too small. What was too small?

    1. (a)

      The trophy

    2. (b)

      The suitcase

  2. 2.

    Winogender (?)
    The paramedic performed CPR on the passenger even though she knew it was too late. Who knew it was too late?

    1. (a)

      The paramedic

    2. (b)

      The passenger

  3. 3.

    Lions eat zebras because they are predators. Who are predators?

    1. (a)


    2. (b)


Figure 3: Examples from existing coreference resolution benchmark tasks. Answers in bold.

2.1.2 Question Answering

Instead of providing one or more focused language processing or reasoning tasks, many benchmarks instead provide a more comprehensive mix of language processing and reasoning skills within a single task. Question answering (QA) is one such comprehensive task. QA is a fairly well-investigated area of artificial intelligence, and there are many existing benchmarks to support work here, however not all of them require commonsense to solve (?). Some QA benchmarks which directly address commonsense are MCScript (?), CoQA (?), OpenBookQA (?), and ReCoRD (?), all of which contain questions requiring commonsense knowledge alongside questions requiring comprehension of a given text. Examples of such questions are listed in Figure 4; here, the commonsense knowledge facts that diapers are typically thrown away, and that steel is a metal, are particularly useful in answering questions. Other QA benchmarks indirectly address commonsense by demanding advanced reasoning processes best informed by commonsense. SQuAD 2.0 is one such example, as it includes unanswerable questions about passages (?), which may take outside knowledge to identify. In the following paragraphs, we introduce all surveyed QA benchmarks.

  1. 1.

    MCScript (?)
    Did they throw away the old diaper?

    1. (a)

      Yes, they put it into the bin.

    2. (b)

      No, they kept it for a while.

  2. 2.

    OpenBookQA (?)
    Which of these would let the most heat travel through?

    1. (a)

      a new pair of jeans.

    2. (b)

      a steel spoon in a cafeteria.

    3. (c)

      a cotton candy at a store.

    4. (d)

      a calvin klein cotton hat.

    Evidence: Metal is a thermal conductor.

  3. 3.

    CoQA (?)
    The Virginia governor’s race, billed as the marquee battle of an otherwise anticlimactic 2013 election cycle, is shaping up to be a foregone conclusion. Democrat Terry McAuliffe, the longtime political fixer and moneyman, hasn’t trailed in a poll since May. Barring a political miracle, Republican Ken Cuccinelli will be delivering a concession speech on Tuesday evening in Richmond. In recent …

    Who is the democratic candidate?
    Terry McAuliffe
    Evidence: Democrat Terry McAuliffe

    Who is his opponent?
    Ken Cuccinelli
    Evidence: Republican Ken Cuccinelli

Figure 4: Examples from QA benchmarks which require common and/or commonsense knowledge. Answers in bold.

The AI2 Reasoning Challenge (ARC) from ? (?) provides a dataset of almost 8,000 four-way multiple-choice science questions and answers, as well as a corpus of 14 million science-related sentences which are claimed to contain most of the information needed to answer the questions. As many of the questions require systems to draw information from multiple sentences in the corpus to answer correctly, it is not possible to accurately solve the task by simply searching the corpus for keywords. As such, this task encourages advanced reasoning which may be useful to commonsense. ARC can be downloaded at


The MCScript benchmark by ? (?) is one of few QA benchmarks which emphasizes commonsense. The dataset consists of about 14,000 2-way multiple-choice questions based on short passages, with a large proportion of its questions requiring pure commonsense knowledge to answer, and thus can be answered without the passage. Questions are conveniently labeled with whether they are answered with the provided text or commonsense. A download link to MCScript can be found at


ProPara by ? (?) consists of 488 annotated paragraphs of procedural text. These paragraphs describe various processes such as photosynthesis and hydroelectric power generation in order for systems to learn object tracking in processes which involve changes of state. The authors assert that recognizing these state changes can require world knowledge, so proficiency in commonsense reasoning is necessary to perform well. Annotations of the paragraphs are in the form of a grid which describes the state of each participant in the paragraph after each sentence of the paragraph. If a system understands a paragraph in this dataset, it is said that for each entity mentioned in the paragraph, it can answer any question about whether the entity is created, destroyed, or moved, and when and where this happens. To answer all possible perfectly accurately, a system must produce a grid identical to the annotations for the paragraph. Thus, systems are evaluated by their performance on this task. ProPara data is linked from


MultiRC by ? (?) is a reading comprehension dataset consisting of about 10,000 questions posed on over 800 paragraphs across a variety of topic domains. It differs from a traditional reading comprehension dataset in that most questions can only be answered by reasoning over multiple sentences in the accompanying paragraphs, answers are not spans of text from the paragraph, and the number of answer choices as well as the number of correct answers for each question is entirely variable. All of these features make it difficult for shallow and artificial approaches to perform well on the benchmark, encouraging a deeper understanding of the passage. Further, the benchmark includes a variety of nontrivial semantic phenomena in passages, such as coreference and causal relationships, which often require commonsense to recognize and parse. MultiRC can be downloaded at


The Stanford Question Answering Dataset (SQuAD) from ? (?) was originally a set of about 100,000 open-ended questions posed on passages from Wikipedia articles, which are provided with the questions. The initial dataset did not require commonsense to solve; questions required little reasoning, and answers were spans of text directly from the passage. To make this dataset more challenging, SQuAD 2.0 (?) was later released to add about 50,000 additional questions which are unanswerable given the passage. Determining whether a question is answerable may require outside knowledge, or at least some more advanced reasoning. All SQuAD data can be downloaded at


The Conversational Question Answering (CoQA) dataset by ? (?) contains passages each accompanied with a set of questions in conversational form, as well as their answers and evidence for the answers. There are about 127,000 questions in the dataset total, but as they are conversational, questions pertaining to a passage must be answered together in order. While the conversational element of CoQA is not new, e.g., QuAC by ? (?), CoQA becomes unique by including questions pertaining directly to commonsense reasoning, following ? (?) in addressing a growing need for reading comprehension datasets which require various forms of reasoning. CoQA also includes out-of-domain types of questions appearing only in the test set, and unanswerable questions throughout. CoQA data can be downloaded at


OpenBookQA by ? (?) intends to address the shortcomings of previous QA datasets. Earlier datasets often do not require commonsense or any advanced reasoning to solve, and those that do require vast domains of knowledge which are difficult to capture. OpenBookQA contains about 6,000 4-way multiple choice science questions which may require science facts or other common and commonsense knowledge. Instead of providing no knowledge resource like MCScript (?), or an inhibitively large corpus of facts to support answering the questions like ARC (?), OpenBookQA provides an "open book" of about 1,300 science facts to support answering the questions, each associated directly with the question(s) they apply to. For required common and commonsense knowledge, the authors expect that outside resources will be used in answering the questions. Information for downloading OpenBookQA can be found at


CommonsenseQA by ? (?) is a QA benchmark which directly targets commonsense, like CoQA (?) and ReCoRD (?), consisting of 9,500 three-way multiple-choice questions. To ensure an emphasis on commonsense, each question requires one to disambiguate a target concept from three connected concepts in ConceptNet, a commonsense knowledge graph (?). Utilizing a large knowledge graph like ConceptNet ensures not only that questions target commonsense relations directly, but that the types of commonsense knowledge and reasoning required by questions are highly varied. CommonsenseQA data can be downloaded at

2.1.3 Textual Entailment

Recognizing textual entailment is another such comprehensive task. Textual entailment is defined by ? (?) as a directional relationship between a text and a hypothesis, where it can be said that the text entails the hypothesis if a typical person would infer that the hypothesis is true given the text. Some tasks expand this by also requiring recognition of contradiction, e.g., the fourth and fifth RTE Challenges (?, ?). Performing such a task requires the utilization of several simpler language processing skills, such as paraphrase, object tracking, and causal reasoning, but since it also requires a sense of what a typical person would infer, commonsense knowledge is often essential to textual entailment tasks. The RTE Challenges (?) are the classic benchmarks for entailment, but there are now several larger benchmarks inspired by them. Examples from these benchmarks are listed in Figure 5, and they require commonsense knowledge such as the process of making snow angels, and the relationship between the presence of a crowd and loneliness. In the following paragraphs, we introduce all textual entailment benchmarks in detail.

  1. 1.

    RTE Challenge (?)
    Text: American Airlines began laying off hundreds of flight attendants on Tuesday, after a federal judge turned aside a union’s bid to block the job losses.
    Hypothesis: American Airlines will recall hundreds of flight attendants as it steps up the number of flights it operates.

    Label: not entailment

  2. 2.

    SICK (?)33 3 Example extracted from the SICK data available at
    Sentence 1: Two children are lying in the snow and are drawing angels.
    Sentence 2: Two children are lying in the snow and are making snow angels.

    Label: entailment

  3. 3.

    SNLI (?)
    Text: A black race car starts up in front of a crowd of people.
    Hypothesis: A man is driving down a lonely road.

    Label: contradiction

  4. 4.

    MultiNLI, Telephone (?)
    Context: that doesn’t seem fair does it
    Hypothesis: There’s no doubt that it’s fair.

    Label: contradiction

  5. 5.

    SciTail (?)
    Premise: During periods of drought, trees died and prairie plants took over previously forested regions.
    Hypothesis: Because trees add water vapor to air, cutting down forests leads to longer periods of drought.

    Label: neutral

Figure 5: Examples from RTE benchmarks. Answers in bold.
RTE Challenges.

An early attempt at an evaluation scheme for commonsense reasoning was the Recognizing Textual Entailment (RTE) Challenge (?). The inaugural challenge provided a task where given a text and hypothesis, systems were expected to predict whether the text entailed the hypothesis. In following years, more similar Challenges took place (?, ?). The fourth and fifth Challenge added a new three-way decision task where systems were additionally expected to recognize contradiction relationships between texts and hypotheses (?, ?). The main task for the sixth and seventh Challenges instead provided one hypothesis and several potential entailing sentences in a corpus (?, ?). The eighth Challenge (?) addressed a bit different problem which focused on classifying student responses as an effort toward providing automatic feedback in an educational setting. The first five RTE Challenge datasets consisted of around 1,000 examples each (?, ?, ?, ?, ?), while the sixth and seventh consisted of about 33,000 and 49,000 examples respectively. Data from all RTE Challenges can be downloaded at


The Sentences Involving Compositional Knowledge (SICK) benchmark by ? (?) is a collection of close to 10,000 pairs of sentences. Two tasks are presented with this dataset, one for sentence relatedness, and one for entailment. The entailment task, which is more related to our survey, is a 3-way decision task in the style of RTE-4 (?) and RTE-5 (?). SICK can be downloaded at


The Stanford Natural Language Inference (SNLI) benchmark from ? (?) contains nearly 600,000 sentence pairs, and also provides a 3-way decision task similar to the fourth and fifth RTE Challenges (?, ?). In addition to gold labels for entailment, contradiction, or neutral, the SNLI data includes five crowd judgments for the label, which may indicate a level of confidence or agreement for it. This benchmark is later expanded into MultiNLI (?), which follows the same format, but includes sentences of various genres, such as telephone conversations. MultiNLI is included within the previously introduced GLUE benchmark (?), while SNLI can be downloaded at


SciTail by ? (?) consists of about 27,000 premise-hypothesis sentence pairs adapted from science questions into a 2-way entailment task similar to the first RTE Challenge (?). Unlike other entailment tasks, this one is primarily science-based, which may require some knowledge more advanced than everyday commonsense. SciTail can be downloaded at

2.1.4 Plausible Inference

While textual entailment benchmarks require one to draw concrete conclusions, others require hypothetical, intermediate, or uncertain conclusions, defined as plausible inference (?). Such benchmarks often focus on everyday events, which contain a wide variety of practical commonsense relations. Examples from these benchmarks are listed in Figure 6, and they require commonsense knowledge of everyday interactions, e.g., answering a door, and activities, e.g., cooking. In the following paragraphs, we introduce all plausible inference commonsense benchmarks.

  1. 1.

    COPA (?)
    I knocked on my neighbor’s door. What happened as result?

    1. (a)

      My neighbor invited me in.

    2. (b)

      My neighbor left his house.

  2. 2.

    ROCStories (?)
    Tom and Sheryl have been together for two years. One day, they went to a carnival together. He won her several stuffed bears, and bought her funnel cakes. When they reached the Ferris wheel, he got down on one knee.


    1. (a)

      Tom asked Sheryl to marry him.

    2. (b)

      He wiped mud off of his boot.

  3. 3.

    SWAG (?)
    He pours the raw egg batter into the pan. He

    1. (a)

      drops the tiny pan onto a plate

    2. (b)

      lifts the pan and moves it around to shuffle the eggs.

    3. (c)

      stirs the dough into a kite.

    4. (d)

      swirls the stir under the adhesive.

  4. 4.

    JOCI (?)
    Context: John was excited to go to the fair
    Hypothesis: The fair opens.
    Label: 5 (very likely)

    Context: Today my water heater broke
    Hypothesis: A person looks for a heater.
    Label: 4 (likely)

    Context: John’s goal was to learn how to draw well
    Hypothesis: A person accomplishes the goal.
    Label: 3 (plausible)

    Context: Kelly was playing a soccer match for her University
    Hypothesis: The University is dismantled.
    Label: 2 (technically possible)

    Context: A brown-haired lady dressed all in blue denim sits in a group of pigeons.
    Hypothesis: People are made of the denim.
    Label: 1 (impossible)

Figure 6: Examples from commonsense benchmarks requiring plausible inference. Answers in bold.

The Choice of Plausible Alternatives (COPA) task by ? (?) provides a premise, each with two possible alternatives of causes or effects of the premise. Examples require both forward and backward causal reasoning, meaning that the premise could either be a cause or effect of the correct alternative. COPA data, which consists of 1,000 examples total, can be downloaded at


The Children’s Book Test (CBT) from ? (?) consists of about 687,000 cloze-style questions from 20-sentence passages mined from publicly available children’s books. These questions require a system to fill in a blank in a line of a story passage given a set of 10 candidate words to fill the blank. Questions are classified into 4 tasks based on the type of the missing word to predict, which can be a named entity, common noun, verb, or proposition. CBT can be downloaded at


ROCStories by ? (?) is a corpus of about 50,000 five-sentence everyday life stories, containing a host of causal and temporal relationships between events, ideal for learning commonsense knowledge rules. Of these 50,000 stories, about 3,700 are designated as test cases, which include a plausible and implausible alternate story ending for trained systems to choose between. The task of solving the ROCStories test cases is called the Story Cloze Test, a more challenging alternative to the narrative cloze task proposed by ? (?). The most recent release of ROCStories, which can be requested at, adds about 50,000 stories to the dataset.


The JHU Ordinal Commonsense Inference (JOCI) benchmark by ? (?) consists of about 39,000 sentence pairs, each consisting of a context and hypothesis. Given these, systems must rate how likely the hypothesis is on a scale from 1 to 5, where 1 corresponds to impossible, 2 to technically possible, 3 to plausible, 4 to likely, and 5 to very likely. This task is similar to SNLI (?) and other 3-way entailment tasks, but provides more options which essentially range between entailment and contradiction. This is fitting, considering the fuzzier nature of plausible inference tasks compared to textual entailment tasks. JOCI can be downloaded from


The Cloze Test by Teachers (CLOTH) benchmark by ? (?) is a collection of nearly 100,000 4-way multiple-choice cloze-style questions from middle- and high school-level English language exams, where the answer fills in a blank in a given text. Each question is labeled with the type of reasoning it involves, where the four possible types are grammar, short-term reasoning, matching/paraphrasing, and long-term reasoning. CLOTH data can be downloaded at


Situations with Adversarial Generations (SWAG) from ? (?) is a benchmark dataset of about 113,000 beginnings of small texts each with four possible endings. Given the context each text provides, systems decide which of the four endings is most plausible. Examples also include labels for the source of the correct ending, and ordinal labels for the likelihood of each possible ending and the correct ending. SWAG data can be downloaded at


The Reading Comprehension with Commonsense Reasoning (ReCoRD) benchmark by ? (?), similar to SQuAD (?), consists of questions posed on passages, particularly news articles. However, questions in ReCoRD are in cloze format, requiring more hypothetical reasoning, and many questions explicitly require commonsense reasoning to answer. Named entities are identified in the data, and are used to fill the blanks for the cloze task. The benchmark data consist of over 120,000 examples, most of which are claimed to require commonsense reasoning. ReCoRD can be downloaded at

2.1.5 Psychological Reasoning

An especially significant domain of knowledge in plausible inference tasks is the area of human sociopsychology, as inference of emotions and intentions through behavior is a fundamental capability of humans (?). Several benchmarks touch on social psychology in some examples, e.g., the marriage proposal example in ROCStories (?) from Figure 6, but some are entirely focused here. Examples from each benchmark are listed in Figure 7, and they require sociopsychological commonsense knowledge such as plausible reactions to being punched or yelled at. In the following paragraphs, we introduce these benchmarks in detail.

  1. 1.

    Triangle-COPA (?)
    A circle and a triangle are in the house and are arguing. The circle punches the triangle. The triangle runs out of the house. Why does the triangle leave the house?

    1. (a)

      The triangle leaves the house because it wants the circle to come fight it outside.

    2. (b)

      The triangle leaves the house because it is afraid of being further assaulted by the circle.

  2. 2.

    SC (?)44 4 Example extracted from Story Commonsense development data available at
    Jervis has been single for a long time.
    He wants to have a girlfriend.
    One day he meets a nice girl at the grocery store.

    Motivation: to be loved, companionship
    Emotions: shocked, excited, hope, shy, fine, wanted

    Maslow: love
    Reiss: contact, romance
    Plutchik: joy, surprise, anticipation, trust, fear

  3. 3.

    Event2Mind (?)
    PersonX starts to yell at PersonY

    PersonX’s intent: to express anger, to vent their frustration, to get PersonY’s full attention
    PersonX’s reaction: mad, frustrated, annoyed
    Other’s reactions: shocked, humiliated, mad at PersonX

Figure 7: Examples from sociopsychological inference benchmarks. Answers in bold.

Triangle-COPA by ? (?) is a variation of COPA (?) based on a popular social psychology experiment. It contains 100 examples in the format of COPA, and accompanying videos. Questions focus specifically on emotions, intentions, and other aspects of social psychology. The data also includes logical forms of the questions and alternatives, as the paper focuses on logical formalisms for psychological commonsense. Triangle-COPA can be downloaded at

Story Commonsense.

As mentioned earlier, the stories from ROCStories by ? (?) are rich in sociological and psychological instances of commonsense. Motivated by classical theories of motivation and emotions from psychology, ? (?) created the Story Commonsense (SC) benchmark containing about 160,000 annotations of the motivations and emotions of characters in ROCStories to enable more concrete reasoning in this area. In addition to the tasks of generating motivational and emotional annotations, the dataset introduces three classification tasks: one for inferring the basic human needs theorized by ? (?), one for inferring the human motives theorized by ? (?), and one for inferring the human emotions theorized by ? (?). SC can be downloaded at


In addition to motivations and emotions, systems may need to infer intentions and reactions surrounding events. To support this, ? (?) introduce Event2Mind, a benchmark dataset of about 57,000 annotations of intentions and reactions for about 25,000 unique events extracted from other corpora, including ROCStories (?). Each event involves one or two participants, and presents three tasks of predicting the primary participant’s intentions and reactions, and predicting the reactions of others. Event2Mind can be downloaded at

2.1.6 Multiple Tasks

Some benchmarks consist of several focused language processing or reasoning tasks so that reading comprehension skills can be learned one by one in a consistent format. While bAbI includes some tasks for coreference resolution, it also includes other prerequisite skills in other tasks, e.g., relation extraction (?). Similarly, the recognition of sentiment, paraphrase, grammaticality, and even puns are focused on in different tasks within the DNC benchmark (?). Table 1 provides a comparison of all such multi-task commonsense benchmarks by the types of language processing tasks they provide. In the following paragraphs, we introduce these benchmarks in detail.


The bAbI benchmark from ? (?) consists of 20 prerequisite tasks, each with 1,000 examples for training and 1,000 for testing. Each task presents systems with a passage, then asks a reading comprehension question, but each task focuses on a different type of reasoning or language processing task, allowing systems to learn basic skills one at a time. Tasks are as follows:

  1. 1.

    Single supporting fact

  2. 2.

    Two supporting facts

  3. 3.

    Three supporting facts

  4. 4.

    Two argument relations

  5. 5.

    Three argument relations

  6. 6.

    Yes/no questions

  7. 7.


  8. 8.


  9. 9.

    Simple negation

  10. 10.

    Indefinite knowledge

  11. 11.

    Basic coreference

  12. 12.


  13. 13.

    Compound coreference

  14. 14.

    Time reasoning

  15. 15.

    Basic deduction

  16. 16.

    Basic induction

  17. 17.

    Positional reasoning

  18. 18.

    Size reasoning

  19. 19.

    Path finding

  20. 20.

    Agent’s motivations

In addition to providing focused language processing tasks as previously discussed, bAbI also provides focused commonsense reasoning tasks requiring particular kinds of logical and physical commonsense knowledge, such as its tasks for deduction and induction, and time, positional, and size reasoning (?). Selected examples of these tasks are given in Figure 8, and demand commonsense knowledge such as the fact that members of an animal species are typically all the same color, and the relationship between objects’ size and their ability to contain each other. bAbI can be downloaded at

\theadReasoning Type \theadbAbI \theadIIE \theadGLUE \theadDNC
Semantic Role Labeling
Relation Extraction
Event Factuality
Named Entity Recognition
Coreference Resolution
Lexicosyntactic Inference
Sentiment Analysis
Figurative Language
Textual Entailment
Table 1: Comparison of language processing tasks present in the bAbI (?), IIE (?), GLUE (?), and DNC (?) benchmarks. Recent multi-task benchmarks focus on the recognition of an increasing variety of linguistic and semantic phenomena.
  1. 1.

    Task 15: Basic Deduction
    Sheep are afraid of wolves.
    Cats are afraid of dogs.
    Mice are afraid of cats.
    Gertrude is a sheep.

    What is Gertrude afraid of?

  2. 2.

    Task 16: Basic Induction
    Lily is a swan.
    Lily is white.
    Bernhard is green.
    Greg is a swan.

    What color is Greg?

  3. 3.

    Task 17: Positional Reasoning
    The triangle is to the right of the blue square.
    The red square is on top of the blue square.
    The red sphere is to the right of the blue square.

    Is the red square to the left of the triangle?

  4. 4.

    Task 18: Size Reasoning
    The football fits in the suitcase.
    The suitcase fits in the cupboard.
    The box is smaller than the football.

    Will the box fit in the suitcase?

Figure 8: Examples of selected logically- and physically-grounded commonsense reasoning tasks from bAbI by ? (?). Answers in bold.
Inference is Everything.

Inference is Everything (IIE) by ? (?) follows bAbI (?) in creating a suite of tasks, where each task is deliberately geared toward a different language processing task: semantic proto-role labeling, paraphrase, and pronoun resolution. Each task is in classic RTE Challenge format (?), i.e., given context and hypothesis texts, one must determine whether the context entails the hypothesis. Between the tasks, IIE includes about 300,000 examples, all of which are recast from previously existing datasets. IIE can be downloaded with another multi-task suite at


The General Language Understanding Evaluation (GLUE) dataset from ? (?) consists of 9 focused to more comprehensive tasks in various forms, including single-sentence binary classification and 2- or 3-way entailment comparable to the dual tasks in RTE-4 and RTE-5 (?, ?). Most of these tasks either directly relate to commonsense, or may be useful in creating systems which utilize traditional linguistic processes like paraphrase in performing commonsense reasoning. The GLUE tasks are recast or included directly from other benchmark data and corpora:

  • Corpus of Linguistic Acceptability (CoLA) from ? (?)

  • Stanford Sentiment Treebank (SST-2) from ? (?)

  • Microsoft Research Paraphrase Corpus (MRPC) from ? (?) ? (?)

  • Quora Question Pairs (QQP) from ? (?)

  • Semantic Textual Similarity Benchmark (STS-B) from ? (?)

  • Multi-Genre Natural Language Inference (MNLI) from ? (?)

  • Question Natural Language Inference (QNLI) recast from SQuAD 1.1 ? (?)

  • Recognizing Textual Entailment (RTE), consisting of examples from RTE-1 (?), RTE-2 (?), RTE-3 (?), and RTE-5 (?)

  • Winograd Natural Language Inference (WNLI) recast from privately shared Winograd schemas by the creators of the Winograd Schema Challenge ? (?)

GLUE includes a small analysis set for diagnostic purposes, which has manual annotations of fine-grained categories pairs of sentences fall into (e.g., commonsense), and labeled reversed versions of examples. Overall, GLUE has over 1 million examples, which can be downloaded at


The Diverse Natural Language Inference Collection (DNC) by ? (?) consists of 9 textual entailment tasks requiring 7 different types of reasoning. Like IIE by ? (?), data for each task follows the form of the original RTE Challenge (?). Some of the tasks within DNC cover fundamental reasoning skills which are required for any reasoning system, while others cover more challenging reasoning skills which require commonsense. Each task is recast from a previously existing dataset:

  1. 1.

    Event Factuality, recast from UW (?), MEANTIME (?), and (?)

  2. 2.

    Named Entity Recognition, recast from the Groningen Meaning Bank (?) and the ConLL-2003 shared task (?)

  3. 3.

    Gendered Anaphora Resolution, recast from the Winogender dataset (?)

  4. 4.

    Lexicosyntactic Inference, recast from MegaVeridicality (?), VerbNet (?), and VerbCorner (?)

  5. 5.

    Figurative Language, recast from puns by ? (?) and ? (?)

  6. 6.

    Relation Extraction, partially from FACC1 (?)

  7. 7.

    Subjectivity, recast from ? (?)

The DNC benchmark consists of about 570,000 examples total, and can be downloaded at

2.2 Criteria and Considerations for Creating Benchmarks

The goal of benchmarks is to support technology development and provide a platform to measure research progress in commonsense reasoning. Whether this goal can be achieved depends on the nature of the benchmark. This section identifies the successes and lessons learned from the existing benchmarks and summarizes key considerations and criteria that should guide the creation of the benchmarks, particularly in the areas of task format, evaluation schemes, balance of data, and data collection methods.

2.2.1 Task Format

In creating benchmarks, determining the formulation of the problem is an important step. Among existing benchmarks, there exist a few common task formats, and while some task formats are interchangeable, others are only suited for particular tasks. We provide a review of these formats, indicating the types of tasks they are suitable for.

Classification tasks.

Most benchmark tasks are classification problems, where each response is a single choice from a finite number of options. These include textual entailment tasks, which most commonly require a binary or ternary decision about a pair of sentences, cloze tasks, which require a multiple-choice decision to fill in a blank, and traditional multiple-choice question answering tasks.

Textual entailment tasks. A highly popular format was originally introduced by the RTE Challenges, where given a pair of texts, i.e., a context and hypothesis, one must determine whether the context entails the hypothesis (?). In the fourth and fifth RTE Challenges, this format was extended to a three-way decision problem where the hypothesis may contradict the context (?, ?). The JOCI benchmark further extends the problem to a five-way decision task, where the hypothesis text may range from impossible to very likely given the context (?).

While this format is typically used for textual entailment problems like the RTE Challenges, it can be used for nearly any type of inference problem. Some multi-task benchmarks have adopted the format for several different reasoning tasks, for example, the Inference is Everything (?), GLUE (?), and DNC (?) benchmarks use the classic RTE format for most of their tasks by automatically recasting previously existing datasets into it. Many of these problems deal either with reasoning processes more focused than the RTE Challenges, or more advanced reasoning than the RTE Challenges demand, showing the flexibility of the format. Some such tasks include coreference resolution, recognition of puns, and question answering. Examples from these recasted tasks are listed in Figure 9.

  1. 1.

    GLUE, Question Answering NLI
    Context: Who was the main performer at this year’s halftime show?

    Hypothesis: The Super Bowl 50 halftime show was headlined by the British rock group Coldplay with special guest performers Beyoncé and Bruno Mars, who headlined the Super Bowl XLVII and Super Bowl XLVIII halftime shows, respectively.

    Label: entailed

  2. 2.

    IIE, Definite Pronoun Resolution
    Context: The bird ate the pie and it died.

    Hypothesis: The bird ate the pie and the bird died.

    Label: entailed

  3. 3.

    DNC, Figurative Language
    Context: Carter heard that a gardener who moved back to his home town rediscovered his roots.

    Hypothesis: Carter heard a pun.

    Label: entailed

Figure 9: Examples from tasks within the General Language Understanding Evaluation (GLUE) benchmark by ? (?), Inference is Everything (IIE) by ? (?), and the Diverse NLI Collection (DNC) by ? (?). Each task is recast from preexisting data into classic RTE format.

Cloze tasks. Another popular format is the cloze task, originally conceived by ? (?). Such a task typically involves the deletion of one or more words in a text, essentially requiring one to fill in the blank from a set of choices, a format suitable for language modeling problems. This format has been used in several commonsense benchmarks, including CBT (?), the Story Cloze Test for the ROCStories benchmark (?), CLOTH (?), SWAG (?), and ReCoRD (?). These benchmarks provide anywhere from two to ten options to fill in the blank, and range from requiring the prediction of a single word to parts of sentences and entire sentences. Examples of cloze tasks are listed in Figure 10.

  1. 1.

    CBT (?)

    1. (a)

      Mr. Cropper was opposed to our hiring you .

    2. (b)

      Not , of course , that he had any personal objection to you , but he is set against female teachers , and when a Cropper is set there is nothing on earth can change him .

    3. (c)

      He says female teachers ca n’t keep order .

    4. (d)

      He ’s started in with a spite at you on general principles , and the boys know it .

    5. (e)

      They know he ’ll back them up in secret , no matter what they do , just to prove his opinions .

    6. (f)

      Cropper is sly and slippery , and it is hard to corner him . ”

    7. (g)

      “ Are the boys big ? ”

    8. (h)

      queried Esther anxiously .

    9. (i)

      “ Yes .

    10. (j)

      Thirteen and fourteen and big for their age .

    11. (k)

      You ca n’t whip ’em – that is the trouble .

    12. (l)

      A man might , but they ’d twist you around their fingers .

    13. (m)

      You ’ll have your hands full , I ’m afraid .

    14. (n)

      But maybe they ’ll behave all right after all . ”

    15. (o)

      Mr. Baxter privately had no hope that they would , but Esther hoped for the best .

    16. (p)

      She could not believe that Mr. Cropper would carry his prejudices into a personal application .

    17. (q)

      This conviction was strengthened when he overtook her walking from school the next day and drove her home .

    18. (r)

      He was a big , handsome man with a very suave , polite manner .

    19. (s)

      He asked interestedly about her school and her work , hoped she was getting on well , and said he had two young rascals of his own to send soon .

    20. (t)

      Esther felt relieved .

    21. (u)

      She thought that Mr. _____ had exaggerated matters a little .

      Blank: Baxter, Cropper, Esther, course, fingers, manner, objection, opinions, right, spite

  2. 2.

    ROCStories (?)
    Karen was assigned a roommate her first year of college. Her roommate asked her to go to a nearby city for a concert. Karen agreed happily. The show was absolutely exhilarating.


    1. (a)

      Karen became good friends with her roommate.

    2. (b)

      Karen hated her roommate.

  3. 3.

    CLOTH (?)
    She pushed the door open and found nobody there. "I am the _____ to arrive." She thought and came to her desk.

    1. (a)


    2. (b)


    3. (c)


    4. (d)


  4. 4.

    SWAG (?)
    On stage, a woman takes a seat at the piano. She

    1. (a)

      sits on a bench as her sister plays with the doll.

    2. (b)

      smiles with someone as the music plays.

    3. (c)

      is in the crowd, watching the dancers.

    4. (d)

      nervously sets her fingers on the keys.

  5. 5.

    ReCoRD (?)
    … Daniela Hantuchova knocks Venus Williams out of Eastbourne 6-2 5-7 6-2 …

    Query: Hantuchova breezed through the first set in just under 40 minutes after breaking Williams’ serve twice to take it 6-2 and led the second 4-2 before _____ hit her stride.
    Venus Williams

Figure 10: Examples from cloze tasks. Answers in bold.

Traditional multiple-choice tasks. If not in entailment or cloze form, benchmark classification tasks tend to use traditional multiple-choice questions. Benchmarks which use this format include COPA (?), Triangle-COPA (?), the Winograd Schema Challenge (?), ARC (?), MCScript (?), and OpenBookQA (?). Two-way and four-way decision questions are the most common among these benchmarks.

Open-ended tasks.

On the other hand, some benchmarks require open-ended responses rather than providing a small list of alternatives to choose from. Answers may be restricted to spans of a given text, e.g., SQuAD (?, ?) or CoQA (?). They may be less restricted to a subset of a large number of category labels, e.g., the Maslow, Reiss, and Plutchik tasks in Story Commonsense (?). Of course, they may be purely open-ended, e.g., the motivation and emotion tasks in Story Commonsense, Event2Mind (?) or bAbI (?). Examples of these open-ended formats are listed in Figure 11.

  1. 1.

    SQuAD 2.0 (?)55footnotemark: 5
    In February 2016, over a hundred thousand people signed a petition in just twenty-four hours, calling for a boycott of Sony Music and all other Sony-affiliated businesses after rape allegations against music producer Dr. Luke were made by musical artist Kesha. Kesha asked a New York City Supreme Court to free her from her contract with Sony Music but the court denied the request, prompting widespread public and media response.

    How many people signed a petition to boycott Sony Music in 2016?
    over a hundred thousand

  2. 2.

    SC (?)66 6 Example extracted from SQuAD training data available at
    Valerie was getting ready for a formal dance. She had been preparing for hours. As she was ready to leave, her acrylic nail broke. She snapped off all of her faux nails.

    Maslow: esteem, stability
    Reiss: status, approval, order
    Plutchik: surprise, sadness, disgust, anger

  3. 3.

    bAbI (?)
    The kitchen is north of the hallway.
    The bathroom is west of the bedroom.
    The den is east of the hallway.
    The office is south of the bedroom.

    How do you go from den to kitchen?
    west, north

Figure 11: Examples from open-ended response tasks. Answers in bold.
77footnotetext: Example extracted from Story Commonsense test data available at

2.2.2 Evaluation Schemes

The Turing Test has long been criticized by AI researchers as it does not truly evaluate machine intelligence. There is a critical need for new intelligence benchmarks to support incremental development and evaluation of AI techniques, as described by ? (?). These benchmarks should not merely provide a pass or fail grade, rather they should provide feedback on a continuous scale which enables both incremental development and comparison of approaches. One key consideration for these benchmarks is informative evaluation metrics that are objective and easy to calculate. These metrics can be used to compare different approaches and compare machine performance against human performance.

Evaluation metrics.

Choice of evaluation metrics is highly dependent on the type of task, and thus so is the difficulty of calculating them. Multiple-choice tasks often use exact-match accuracy if correct answers or class labels are evenly distributed through benchmark data. If this is not the case, common practice is to additionally present F-measure as an evaluation metric (?). The precision and recall may be presented, however the F-measure (which considers both) is much more common in the recent surveyed benchmarks. Multiple-choice and classification task formats such as RTE, cloze, and traditional multiple-choice can all use these metrics.

Open-ended tasks are by nature more difficulty to evaluate, but they can still be objective and informative. Open-ended tasks like SQuAD (?) or CoQA (?) where answers can only be spans of a provided text can be evaluated similarly to multiple-choice tasks. Exact-match accuracy and F-measure are used as evaluation metrics on both of these tasks, where the collection of tokens in the predicted and true spans (excluding punctuation and articles) are compared. Where answers are a subset of a large group of category labels, e.g., the Maslow, Reiss, and Plutchik tasks in Story Commonsense, evaluation is similar, but for these tasks particularly, precision and recall are additionally included (?).

Where multiple purely open-ended responses are given, as in Event2Mind (?), evaluation is more difficult. Event2Mind particularly uses the average cross-entropy and the "recall @ 10," i.e., the percentage of times human-produced ground truth labels fall within the top 10 predictions from a system. In bAbI, where a purely open-ended response is compared to a single ground truth answer, exact-match accuracy is used (?). Both Event2Mind and bAbI are able to use such exact evaluation measures because responses are short. In bAbI particularly, correct responses are limited to one word or lists of words. Such restrictions are essential for such simple and accurate evaluation. For longer generated responses, evaluation metrics like BLEU for machine translation (?), or the modified ROUGE for text summarization (?), are useful.

Comparison of approaches.

Benchmarks provide common datasets and experimental setups for researchers to compare different approaches. When a new benchmark is first released, it often reports results from simple baseline approaches. Ideal baselines should lead to relatively low performance, thus leaving room for improvement from more advanced approaches. For multiple-choice problems, baseline approaches are often calculated by random choice, choosing the class appearing the most in the test data, or choosing the alternative with the highest overlap in n-grams with the question or provided text (?). Examples of these baseline approaches can be found in the Story Cloze Test baselines, which include most of these approaches and more (?). For open-ended problems, shallow lexical approaches (e.g., using language models) may again be used, as in the bAbI benchmark(?). When competitive approaches are developed for existing benchmarks, they are often used as baselines in new benchmarks, thus boosting baseline performance over time and encouraging the development of new or improved models to solve problems.

Human performance measurement.

To evaluate the progress of machine intelligence, human performance on benchmark tasks is often measured to provide a reference point, e.g., through crowdsourcing techniques. ROCStories (?), SQuAD (?), and CoQA (?) all provide metrics for human performance. The goal of computational models is to come close to or exceed human performance.

2.2.3 Data Biases

When creating benchmarks, one challenge is the bias of data unintentionally introduced to the benchmark. For example, in the first release of the Visual Question Answering (VQA) benchmark (?), researchers found that machine learning models were learning several statistical biases in the data, and could answer up to 48% of questions in the validation set without seeing the image (?). This artificially high system performance is problematic, as it is not accredit to the underlying technology. Here we summarize several key dimensions of biases encountered in previous commonsense research. Some of these (e.g., class label distributions) are easier to avoid, while others (e.g., hidden correlation biases) are more difficult to address.

Label distribution bias.

Class label distribution bias is the easiest to avoid. In multiple-choice problems, correct answers or class labels should be entirely randomized so that each possible choice appears in benchmark data in a uniform distribution. This way, a majority-class baseline will score as low as possible on the task. While binary-choice tasks should have a 50% majority class baseline, the MegaVeridicality task within DNC (?) has a 67% majority-class baseline due to unevenly distributed class labels, leaving significantly less room for incremental improvement than tasks with lower-performing baselines.

Question type bias.

For benchmarks involving question answering tasks, previous work has made effort to balance the types of questions, especially if questions are generated by crowdsourcing. This will ensure a broad domain of knowledge and reasoning required to solve the task. A fairly simple method to keep a balance of question types is to calculate the distribution of the first words of each question, as the creators of CoQA (?) and CommonsenseQA (?) did. One could also manually label a random sample of questions with categories relating to types of knowledge or reasoning required, or have crowd workers perform this task if an expert is not essential for this process. Examples of this are shown by the creators of SQuAD 2.0 (?) and ARC (?). To entirely avoid question type biases, implementing a standard set of questions for all provided texts may be beneficial. ProPara does for all participants in its procedural paragraphs (?), limiting questions about each entity to whether it is created, destroyed, or moved during the paragraph, and when and where this happens. ? (?) suggest that biases can further be avoided in VQA benchmarks by forcing questions to require a particular skill (e.g., telling time) to be answered, and this rule of thumb can be applicable for textual benchmarks as well.

Superficial correlation bias.

The kind of biases most difficult to discern and avoid perhaps are those caused by accidental correlations between features of answers and questions. One example of this is gender bias, which commonsense reasoning systems are particularly vulnerable to when training on biased data. ? (?) highlight this problem in coreference resolution, showing that systems trained on gender-biased data perform worse in gender pronoun disambiguation tasks. For example, consider the problem from their Winogender dataset in Figure 3: "The paramedic performed CPR on the passenger even though she knew it was too late." In determining who she is, systems trained on gender-biased training data may be more likely, for example, to incorrectly choose the passenger rather than the paramedic due to male gender pronouns appearing more commonly in the context of this occupation than female gender pronouns. To avoid this, gender pronouns should appear equally frequently among other words, especially those related to occupations and activities. Similar gender biases are identified in Event2Mind data, which are derived from movie scripts (?).

When authoring natural language data (e.g., generating questions or hypotheses), some human stylistic artifacts such as predictable sentence structure, presence of certain linguistic phenomena, and vocabulary use can also cause these superficial correlation biases. This is particularly the case if data is authored by crowd workers. In the Story Cloze Test (?), systems are presented a plausible and implausible ending to the story, and must choose which ending is plausible. However, previous work ? (?) has shown that the Story Cloze Test can be solved with up to 75.2% accuracy by only looking at the two possible endings. They do this by exploiting human writing style biases in the possible endings rather than performing actual commonsense reasoning. For example, they find that negative language is used more commonly in the wrong ending (e.g., "hates"), and the correct ending is more likely to use enthusiastic language (e.g., "!"). An example of a biased negative ending is seen in Figure 10. ? (?) have begun work to update the benchmark data and remove these biases.

While generating the Story Cloze Test data was not a fast or simple task for crowd workers, ? (?) suggest that such biases can come from crowd workers’ adoption of predictable annotation strategies and heuristics to quickly generate data. These strategies have been revealed for several textual entailment benchmarks which consist of pairs of short sentences. For example, on the entailment task in SemEval 2014 (?) as part of the SICK benchmark (?), ? (?) found that the presence of negation in an example was strongly associated with the appearance of the contradiction class label. Their trained classifier using this feature alone was able to achieve 61% accuracy. Later, ? (?) and ? (?) found the presence of particular words in the hypothesis sentence can bias the entailment prediction in several entailment benchmarks. For example, "nobody" in contradictory examples from SNLI (?) was found to be an indicator of contradiction, while generic words like "animal" and "instrument", as well as gender-neutral pronouns, were found to be indicators of entailment. ? further find that a high sentence length is an indicator of neutral entailment, and suggest that crowd workers often remove words from the context sentences to create entailed hypothesis sentences. Using biases like these, a baseline approach by ? which only used the hypothesis sentence from entailment benchmarks was able to outperform a majority-class baseline in SNLI, JOCI (?), SciTail (?), two tasks within Inference is Everything (?), and the MultiNLI task within GLUE (?, ?).

These results demonstrate a serious need for greater attention to such bias in commonsense benchmarks. To recognize biases, a simple technique is to calculate the mutual information between words and classes within benchmark examples. This was performed by researchers in discovering stylistic biases in entailment benchmarks (?), but this analysis should be performed on any new benchmark data when it is created. To avoid the biases, more advanced techniques may be required. For example, in creating the SWAG benchmark (?), a novel adversarial filtering process was introduced to ensure writing styles are consistent among ending choices, and the correct answer cannot be identified by exploitative stylistic classifiers. A continuous effort in finding techniques to avoid these kinds of biases will be important for developing future benchmarks.

2.2.4 Collection Methods

Methods to collect benchmark data ideally should be cost-efficient and should result in high-quality and unbiased data. Both manual and automatic approaches have been applied. Manual curation of data can be done by experts/researchers or through crowd-workers, which comes with its own set of considerations. Automatic approaches often involve automatically generating data by applying language models or automatically extracting or mining data from existing resources. As shown in Table 2, a benchmark is often created through a combination of these approaches. For the rest of this section, we summarize pros and cons for these different approaches.

Manual versus automatic generation.

Until recently, many of the existing benchmarks were created manually by groups of experts. This may involve tedious processes like manually collecting data from other corpora or the Internet, e.g., the first RTE Challenge (?), or authoring most data from scratch, e.g., the Winograd Schema Challenge (?, ?, ?, ?). This approach ensures data is high quality and thus typically requires little validation, however it is not scalable. These datasets are often small compared to those created with other approaches.

Recent advances in NLP make it possible to automatically generate textual data (e.g., natural language statements, questions, etc.) for benchmark tasks. Although this approach is scalable and efficient, the quality of data varies, and often depends directly on the language model used. For example, in bAbI (?), as agents interacted with objects in a virtual world and with each other, examples were automatically generated. This method ensures that produced data are sensible to the constraints of the physical world. However, as the questions and answers are written with simple structure, the data is easily understood by machines. Most of the dataset is solved with 100% accuracy just by baseline systems. Consequently, the bAbI tasks are often considered as toy tasks. Further, models trained on bAbI currently cannot generalize well to real-world, naturally-generated data (?). A more sophisticated rule-based method for probabilistically generating text data which encourages more diverse language without any added biases is presented by ? (?). Though automatic natural language generation methods are improving, it is likely that such approaches will still require some manual validation.

Automatic generation versus text mining.

As millions of natural language texts are publicly available on the Internet and in existing datasets, it is possible to build commonsense benchmarks from these texts by automatically mining texts and extracting sentences. This process is most successful when the information source is created by experts and highly accurate. For example, in the CLOTH (?) benchmark, data instances were mined from fill-in-the-blank English tests created by human teachers. While other automatically generated cloze tasks like CBT (?) choose missing words mostly randomly, CLOTH’s cloze task is more challenging as the missing word in each example was chosen by an expert. For many other benchmarks that are built by mining less reliable or consistent sources, there’s often a need for automatic or human validation or filtering, e.g., in creating SWAG (?), which was mined in part from other corpora, such as the Large Scale Movie Description Challenge (?) and the captions in ActivityNet (?).

Crowdsourcing considerations.

Acquiring language data directly from crowd workers has become more feasible in recent years, due to the growth of crowdsourcing platforms, such as Amazon Mechanical Turk. Crowdsourcing has enabled researchers to create larger datasets than ever before, however it comes with a set of considerations relating to task complexity, worker qualification, data validation, and cost optimization.

Task complexity. When creating a crowdsourcing task, it is important to consider the difficulty level of what crowd workers will be expected to do. When given overly complicated instructions, the non-expert crowd workers may find it difficult understanding the instructions and keeping them in mind while working on the task. Easy crowdsourcing tasks typically involve quick pass/fail validation or relabeling of data, e.g., in validating SNLI (?). Difficult crowdsourcing tasks usually require crowd workers to write large texts, e.g., in creating ROCStories (?), where workers wrote five-sentence stories following a fairly elaborate set of restrictions on story content to ensure stories were high-quality, focused, and well-organized. Such restrictions must be explained briefly and clearly, and it may take several pilot studies to ensure workers understand and follow the instructions correctly (?).

Worker qualification. Regardless of task difficulty, efforts should be made to avoid workers submitting invalid data, whether they are trolls or unable to follow directions. This may be done through some sort of qualification task, perhaps requiring a prospective worker to recognize examples of acceptable submissions (?), or testing a prospective worker’s grammar (?). It may also be worthwhile to identify excellent workers, and reward them and/or recruit them for more work (?). In our own experiences with crowdsourcing, we have found that if a worker produces one invalid submission, all of his/her submissions will likely be invalid, and thus the worker should be rejected and potentially banned from the task. On the other hand, if a worker produces an excellent submission, all of his/her submissions will likely be excellent.

Data validation. Even though crowdsourced data is produced by non-experts, data can just as easily be validated by non-experts. In creating ROCStories, ? (?) employ several novel methods of crowdsourced validation of crowdsourced data. Involved validation is especially necessary for difficult crowdsourcing tasks like the authoring of ROCStories, which required crowd workers to write long texts following strict guidelines. Crowdsourced data validation typically just requires a separate group of workers to review generated data and identify any bad examples, e.g., the validation of DNC benchmark data (?). For labeling tasks, multiple crowd workers can label the same examples, and agreement can be measured from this to estimate data quality, e.g., in creating the JOCI benchmark (?).

For complex writing tasks where both data authoring and validation are highly involved, it may be advantageous to present crowdsourced tasks to two interacting workers at once. ? (?) do this to create CoQA from actual human conversations about provided passages on Amazon Mechanical Turk, and achieve high data quality with minimal validation. In creating the data, the two interacting workers validate each others’ work, and can even report workers who do not follow instructions, reducing the burden of worker qualification.

Dataset (Reference)
Size \theadData
Question \theadAnswer
natives \theadAlter-
tation \theadAnno-
dation \theadVali-
RTE-1 (?) 1.37K M M M
RTE-2 (?) 1.60K M M M
RTE-3 (?) 1.60K M M M
RTE-4 (?) 1.00K M M M
RTE-5 (?) 1.20K M M M
RTE-6 (?) 32.7K T M M
RTE-7 (?) 48.8K T M, T M
COPA (?) 1.00K M M M M
SICK (?) 9.84K M, T C C
SNLI (?) 570K T, C C C C
CBT (?) 687K T, A A A
Triangle-COPA (?) 100 M M M M M
ROCStories (?) 98.2K C C C C
WSC (?) 60 M M M
bAbI (?) 40.0K A A
SQuAD 1.1 (?) 108K T, C C C
JOCI (?) 39.1K A, T C C
CLOTH (?) 99.4K T T T C
IIE (?) 313K T, A T, A M
SciTail (?) 27.0K T, C C C C
ARC (?) 7.79K T T T
MCScript (?) 13.9K M, T, C C C C C
SC (?) 161K T C C C
Event2Mind (?) 57.1K T C C
ProPara (?) 488 C C C C
MultiRC (?) 9.87K T, C C C C
SQuAD 2.0 (?) 151K T, C C C
CoQA (?) 8.40K T, C C C
GLUE (?) 1.44M T, A T, A M
DNC (?) 570K M, A, T M, A, T C
OpenBookQA (?) 5.96K C C C C C
SWAG (?) 114K T T, A A C C
ReCoRD (?) 121K A, T A A A A, C
CommonsenseQA (?) 9.40K C T T C
Table 2: Chronological summary of methods used in creating, annotating (i.e., providing extra useful information beyond the answer which the task evaluates upon), and validating selected commonsense benchmarks and tasks, where M refers to manual approaches by experts, A to automatic approaches through language generation, T to text mining, and C to crowdsourcing. Data size is included for comparison of methods used.

Cost optimization. Though hiring crowd workers is typically cheaper than hiring permanent workers, the cost may still be limiting, especially if the creator wishes to properly evaluate the quality of generated data through verification by even more crowd workers. For example, ROCStories (?), consisting of about 50,000 well-evaluated five-sentence stories and 13,500 test cases, cost an average of 26 cents per story and an extra 10 cents per test case, resulting in a total cost close to 15,000 USD to generate the dataset. If the cost of such thorough validation is an issue, validating a random sample of produced data, e.g., in validating SNLI (?), can serve as an indicator of the overall quality of benchmark data.

Ultimately, each method of data collection has its own advantages and drawbacks. While manual authoring results in high-quality, expert-verified data, it is slow and unscalable. Meanwhile, the quality of automatically authored data is highly dependent on the language model used, and though it is faster, it may require manual verification. If using text mining rather than generating from scratch, data is more likely to be representative of human language, however manual verification may still be necessary depending on the source which data are extracted from. And lastly, crowdsourcing is a quick and convenient way to collect human-authored data following any set of criteria or restrictions, but it comes with special considerations that address the difficulty of work, qualification of workers, data validation, and cost optimization. When developing a new benchmark, the above trade-offs will need to be carefully considered.

3 Knowledge Resources

It is estimated that a typical human has accrued several million different axioms of commonsense by adulthood (?). The lack of this commonsense knowledge is one of the major bottlenecks in machine intelligence. In order to remove this bottleneck, decades of efforts have been made in developing various knowledge resources in the field of AI. The acquired knowledge is often represented in various forms such as propositions, taxonomies, ontologies, and semantic networks. In this section, we start with an introduction to several existing knowledge resources, and then discuss the main issues involved in building these resources.

3.1 An Overview of Knowledge Resources for NLU

To understand human language, it is important to have linguistic knowledge resources that allow computers to identify syntactic and semantic structures from language. These structures often need to be augmented with common knowledge and commonsense knowledge in order to reach a full understanding.

3.1.1 Linguistic Knowledge Resources

Linguistic resources have been pivotal in pushing the NLP field forward in the last thirty years. Resources have been developed where annotations for syntactic, semantic, and discourse structures are provided for training machine learning models. Several knowledge bases, particularly for lexical semantics, have also been made available to facilitate semantic processing.

Annotated linguistic corpora.

Widely used linguistic resources include the Penn Treebank (?) and several derivatives of it. The Penn Treebank is perhaps the first annotated corpus that drove the development of earlier machine learning approaches in the 1990s. It started with POS tags and syntactic structures based on context-free grammar. The Wall Street Journal portion of it was further augmented into PropBank (?), which provides the annotation of predicate-argument structures(?). The Penn Discourse Treebank (PDTB) is built upon these (?), adding annotated discourse structures. OntoNotes revises the information in the Wall Street Journal portion of the Penn Treebank and PropBank, integrating it with word sense, proper name, coreference, and ontological annotations, as well as including some Chinese linguistic annotations (?). The Abstract Meaning Representation (AMR) corpus extends PropBank into a sentence-level semantic formalism (?). All of these linguistic corpora can be downloaded by members of the Linguistic Data Consortium at

Lexical resources.

A widely used lexical resource for commonly used nouns, verbs, adjectives and adverbs is WordNet 77 7 (?). Different from a traditional online dictionary, WordNet organizes words in terms of concepts (i.e., a list of synonyms) and their semantic relations to other words (e.g., antonymy, hyponymy/hypernymy, entailment, etc.). WordNet has been applied in many NLP applications which involve, for example, query expansion and similarity measures. There are also resources specifically for verbs. VerbNet 88 8 by ? (?) is a hierarchical English verb lexicon that is created based on the verb classes from the English Verb Classes and Alternations (EVCA) resource by ? (?). VerbNet contains 280 classes of verbs and each class is described by argument structures, selectional restrictions on the arguments, and syntactic descriptions. FrameNet 99 9 (?) provides a database of frame semantics for a set of verbs. It also comes with sentences that are annotated with the frame semantics. Other resources for verbs include VerbOcean 1010 10 which captures a network of 3,500 unique verbs and 22,000 fine-grained relations between the verbs, and VerbCorner 1111 11 (?) which provides crowd-sourced validation for VerbNet.

3.1.2 Common Knowledge Resources

Common knowledge refers to specific facts about the world that are often explicitly stated, for example, "canine distemper is a domestic animal disease." (?). Though this is not the same as commonsense knowledge, it is often required to achieve a deep understanding of both the low- and high-level concepts found in language (?). In this section, we summarize several knowledge resources for common knowledge.


Wikipedia is a large and open source of common knowledge. Yet Another Great Ontology (YAGO) by ? (?) augments WordNet (?) with common knowledge facts extracted from Wikipedia, converting WordNet from a primarily linguistic resource to a common knowledge base. YAGO originally consisted of more than 1 million entities and 5 million facts describing relationships between these entities. YAGO2 grounded entities, facts, and events in time and space, contained 446 million facts about 9.8 million entities (?), while YAGO3 added about 1 million more entities from non-English Wikipedia articles (?). YAGO is available for free download at


DBpedia by ? (?) is another Wikipedia-based knowledge base originally consisting of structured knowledge from more than 1.95 million Wikipedia articles. At its creation, DBpedia included around 103 million Resource Description Framework (RDF) triples1212 12\#section-triples, which are triples of subjects, predicates, and objects which describe semantic relationships. These triples included descriptions of concepts within articles, information about people, links between articles, and category labels from YAGO. The latest version, available for free at, consists of 6.6 million entities, 5.5 million resources classified in the DBpedia ontology, and over 23 billion RDF triples.


Yet another Wikipedia-based resource is WikiTaxonomy by ? (?) consists of about 105,000 well-evaluated semantic links between categories in Wikipedia articles. Categories and relationships are labeled using the connectivity of the conceptual network formed by the categories. The authors demonstrate that this resource can be used to calculate semantic similarity of words, which may be useful in textual entailment or inference tasks. WikiTaxonomy is available for free download at


Freebase by ? (?) was a knowledge graph which originally contained 125 million RDF triples of general human knowledge about 4,000 types of entities and 7,000 properties of entities. This resource was later absorbed into the Google Knowledge Graph for intelligent web searching, however the last release is still available for free download at It contains more than 1.9 billion triples.


As human knowledge is not static, it is advantageous for knowledge bases to grow over time. This is typically done through new releases, however the Never-Ending Language Learner (NELL) by ? (?) continually grows by automatically mining structured beliefs of varying confidence from the web daily. It originally contained 242,000 beliefs about properties of entities, but now contains over 50 million beliefs, with almost 3 million of these having high confidence. This version can be downloaded for free at


Probase by ? (?) is different from previous common knowledge taxonomies in that relationships are probabilistic rather than concrete. Probase consists of 2.7 million concepts extracted from 1.6 billion web pages. Relationships between concepts are described in 20.8 million is-a and is-instance-of pairs, and probabilistic interpretation is possible through provided similarity values between 0 and 1 for each pair of concepts in the knowledge base. Though the original resource is no longer available, the Microsoft Concept Graph which was built upon Probase can be downloaded for free at

3.1.3 Commonsense Knowledge Resources

Commonsense knowledge, on the other hand, is considered obvious to most humans, and not so likely to be explicitly stated (?). ? (?) demonstrate this fact: "if you see a six-foot-tall person holding a two-foot-tall person in his arms, and you are told they are father and son, you do not have to ask which is which." There has been a long effort in capturing and encoding commonsense knowledge. Various knowledge bases have been developed. Here, we give a brief introduction to some of the well-known commonsense knowledge bases.

Note that as there is a fine line between commonsense knowledge and common knowledge, the knowledge bases we describe here may also contain common knowledge. Learning commonsense knowledge requires generalizations over common knowledge, so it is not uncommon for these types of knowledge to appear together in knowledge bases. We present several knowledge resources which focus on commonsense knowledge, but may also include common knowledge.


A well-known project toward encoding commonsense knowledge is Cyc by ? (?), a knowledge base of rules expressing ontological relationships between objects encoded in the CycL language. The types of objects in Cyc include entities, collections, functions, and truth functions. Cyc also includes a powerful inference engine. ResearchCyc, a release of Cyc for the research community, can be licensed for free at According to this site, the latest release of ResearchCyc contains over 7 million commonsense assertions. More recently, there have been efforts to map Cyc to Wikipedia articles in an attempt to connect it to other resources such as DBpedia and Freebase (?, ?).


Another popular knowledge base for commonsense reasoning is ConceptNet from ? (?), a product of the Open Mind Common Sense project by ? (?), which collected free text commonsense assertions from online users. This semantic network originally contained over 1.6 million assertions of commonsense knowledge represented as links between 300,000 nodes representing entities, but subsequent releases have expanded it and added more features. The latest release, ConceptNet 5.5 (?), contains over 21 million links between over 8 million nodes, having been augmented by several additional resources including Cyc (?) and DBpedia (?). It includes knowledge from multilingual resources, and links to knowledge from other knowledge graphs. ConceptNet has been applied in several commonsense reasoning systems, some of which are described in Section 4.3. ConceptNet is open; information for using or downloading it can be found at


Though not technically a knowledge base itself, AnalogySpace (?), another product of the Open Mind Common Sense project, is an algorithm for reducing the dimensionality of commonsense knowledge so that knowledge bases can be more efficiently and accurately reasoned over. Since knowledge in large knowledge bases can be noisy or subjective, it provides a way to make conclusions by analogy, i.e., through recognizing similarities and tendencies by simple vector operations. This projects concepts onto dimensions of goodness, difficulty, and so on, which may help models generalize on sparse benchmark data. AnalogySpace was originally applied to ConceptNet (?), and is included with the latest release of ConceptNet.


SenticNet by ? (?) was originally only a commonsense knowledge base, but later versions incorporated common knowledge as well (?). Though the knowledge base is intended for sentiment analysis, it may be useful in commonsense reasoning tasks which require inference about sentiment. SenticNet is available for free at


Isanette by ? (?) was a semantic network of both common and commonsense knowledge created by combining ProBase (?) and ConceptNet 3 (?) into a set of "is a" relationships and confidences. This work was later cleaned and optimized into IsaCore (?), and demonstrated to be effective for sentiment analysis. IsaCore may also be a useful resource for commonsense reasoning. It is available for free download at


COGBASE by ? (?) uses a novel formalism to represent 2.7 million concepts and 10 million commonsense facts about them. It makes up the core of SenticNet 3 (?). Data from COGBASE can currently be accessed via an online interface and a demo API available at


WebChild by ? (?) was originally a commonsense knowledge base of general noun-adjective relations extracted from Web content and other resources, consisting of about 78,000 distinct noun senses, 5,600 distinct adjective senses, and 4.6 million assertions between them. These assertions captured fine-grained relations among the noun and adjective senses. Unlike other resources collected from the Web, WebChild consists primarily of commonsense knowledge, as it consists of generalized, fine-grained relationships between nouns and adjectives collected from various corpora rather than structured common knowledge taken directly from a single open source like Wikipedia. WebChild 2.0 (?) was later released to include over 2 million concepts and activities, and over 18 million such assertions. Data from WebChild can be browsed and downloaded online at


? (?) claim that objects which tend to be near each other (e.g., silverware, a plate, and a glass) is a type of commonsense knowledge lacking in previous knowledge bases like ConceptNet 5.5 (?). They refer to this property as LocatedNear, and to address this issue, they create two datasets which we refer to jointly as LocatedNear. The first consists of 5,000 sentences describing scenes of two objects labeled for whether the objects tend to occur near each other, which can serve as a commonsense task similar to those introduced in Section 2. The second consists of 500 pairs of objects with human-produced confidence scores for how likely the objects are to appear near each other. These resources can be downloaded from


The Atlas of Machine Commonsense (ATOMIC) by ? (?) is a knowledge graph consisting of about 300,000 nodes corresponding to short textual descriptions of events, and about 877,000 "if-event-then" triples representing if-then relationships between everyday events. Rather than taxonomic or ontological knowledge, this graph contains easily-accessed inferential knowledge. ? demonstrate that neural models can learn simple commonsense reasoning skills from ATOMIC which can be used to make inferences about previously unseen events. ATOMIC can be browsed and downloaded for free at

3.2 Approaches to Creating Knowledge Resources

Similar to creating commonsense reasoning benchmarks described in Section 2, various approaches have been applied to create knowledge resources. These approaches range from manual encoding by experts, to text mining from web documents, and to collection through crowdsourcing. A detailed description of these approaches is provided by ? (?). Here, we give a brief discussion about pros and cons of these approaches.

Manual encoding.

Early knowledge bases were often manually created. The classic example of this is Cyc, which is produced by knowledge engineers who hand-code commonsense knowledge into the CycL formalism (?). Since its first release in 1984, Cyc has been going through continuous development over the last 35 years. The cost of this manual encoding is high, with a total estimated cost of $120M (?). As a consequence, Cyc is small relative to other resources, and growing very slowly. On the other hand, this expert-based approach ensures high quality of data.

Text mining.

Text mining and information extraction tools such as TextRunner (?) and KnowItAll (?) are applied to automatically generate knowledge graphs and taxonomies from information sources. One popular information source is Wikipedia, which was drawn from in creating common knowledge bases such as YAGO (?), DBpedia (?), WikiTaxonomy (?). Other knowledge bases are generated from crawling the Web, e.g., NELL (?), or even from other knowledge bases, e.g., IsaCore (?). One key advantage of text mining approaches is cost efficiency. According to ? (?), creating a statement in the Wikipedia-extracted DBpedia and YAGO costs 1.85¢and 0.83¢respectively (USD), which are hundreds of folds less than the cost of manually encoding a statement in Cyc (which was estimated at about $5.71 per statement). This makes text mining approaches easily scale up to create large knowledge bases. However, the drawback is that the acquired knowledge can be noisy and inconsistent, and may often need human validation.


Another highly popular approach to creating knowledge bases is crowdsourcing. The Open Mind Common Sense project responsible for producing ConceptNet (?) used a competitive online game to accept statements from humans in free text (?). Later, researchers converted the knowledge within collected statements into a knowledge graph by automatic processes. This method of using games to attract users to perform human intelligence tasks for free has been applied in creating other knowledge resources such as VerbCorner (?) and the Robot Trainer knowledge base by ? (?), where players must teach a virtual robot human knowledge. The cost of crowdsourcing effort is difficult to assess. It ranges from literally getting it for free (e.g., the Open Mind Common Sense platform where users who submitted data were unpaid) to an estimate of $2.25 per statement (?). Though the gaming approach may be cheaper in the long run, developing such a game platform is inevitably more time-consuming. Another challenge of the crowdsourcing approach, as pointed out by ? (?), is that naive crowd workers may not be able to follow the theories and representations of knowledge that engineers have worked out. As a result, knowledge acquired by crowdsourcing can be somewhat messy, which again often needs human expert validation.

Each of these methods has its own advantages and drawbacks in terms of the trade-offs between the cost and the quality of the acquired knowledge. Most of these knowledge resources are developed from a bottom-up fashion. The goal is to create general knowledge bases to provide inductive bias for a variety of learning and reasoning tasks. Nevertheless, it is not clear whether such a goal is met and to what extent these knowledge resources are applied to commonsense reasoning in practice. A systematic study, as suggested by ? (?), for Cyc and other resources would be useful.

4 Learning and Inference Approaches

To solve the benchmark tasks described in Section 2, a variety of approaches have been developed. These range from earlier symbolic and statistical approaches to recent approaches that apply deep learning and neural networks. This section gives a brief overview to some representative approaches.

4.1 Symbolic and Statistical Approaches

Manually authored logic rules and formalisms have been demonstrated to perform well for various reasoning tasks. ? (?) provides a more detailed review of these approaches in commonsense reasoning, but we introduce a few which have been applied to the surveyed commonsense benchmarks. Manually authored logical rules were applied, for example, in systems for the earlier RTE Challenges (?). They were also applied in more recent work such as the baseline approach to the Triangle-COPA benchmark which achieved 91% accuracy (?). The highest-performing system (?) in the 3-way task of the fourth RTE challenge (?) used both manually authored logical rules and outside knowledge from Wikipedia, WordNet (?), and VerbOcean (?). While manually authored logical rules have been demonstrated to be highly effective in some tasks (?), this approach is not scalable for more complex tasks and reasoning.

Statistical approaches often rely on engineered features to train statistical models for various tasks. Lexical features, for example, based on bag of words and word matching were commonly used in the earlier RTE Challenges (?, ?), but often achieved results only slightly better than random guessing (?). More competitive systems have used more linguistic features to make predictions, such as semantic dependencies and paraphrases (?), synonym, antonym, and hypernym relationships derived from training data, and hidden correlation biases in benchmark data ? (?).

External knowledge and the Web are often used to complement features derived from the training data. For example, the best system in the first RTE Challenge (?) used a naïve Bayes classifier with features from the co-occurrences of word from an online search engine (?). A similar approach was also applied in the top system from the seventh RTE Challenge (?), which utilized knowledge resources from Section 3, acronyms extracted from the training data, and linguistic knowledge to calculate a statistical measure of entailment between sentences (?). While the use of some external knowledge provides an advantage over models which only use linguistic features extracted from training data, statistical models have still not been competitive in recent benchmarks of large data size. Nonetheless, such models may serve as useful baselines for new benchmarks, as demonstrated for JOCI (?).

4.2 Neural Approaches

Figure 12: Common components in neural approaches to language-based commonsense tasks.

The increasingly large amount of data available for recent benchmarks make it possible to train neural models. These approaches often top various leaderboards. Figure 12 shows some common components in neural models. First of all, distributional representation of words is fundamental where word vectors or embeddings are usually trained using neural networks on large-scale text corpora. In traditional word embedding models like word2vec (?) or GloVe (?), the embedding vectors are context independent. No matter what context the target word appears in, once trained, its embedding vector is always the same. Consequently, these embeddings lack the capability of modeling different word senses in different context, although this phenomena is prevalent in language. To address this problem, recent work has developed contextual word representation models, e.g., Embeddings from Language Models (ELMo) by ? (?) and Bidirectional Encoder Representations from Transformers (BERT) by ? (?). These models give words different embedding vectors based on the context in which they appear. These pre-trained word representations can be used as features or fine-tuned for downstream tasks. For example, the Generative Pre-trained Transformer (GPT) by ? (?) and BERT (?) introduce minimal task-specific parameters, and can be easily fine-tuned on the downstream tasks with modified final layers and loss functions.

On top of the word embedding layers, task-specific network architectures are designed for different downstream applications. These architectures often adopt recurrent neural networks (RNNs, e.g., LSTM and GRU), convolutional neural networks (CNNs), or more recently, transformers to solve specific tasks. And the output layers of networks are chosen based on the task formulation, e.g., linear layer plus softmax for classifications, language decoder for language generations. Because of the sequential nature of language, RNN-based architectures are widely applied and are often implemented in both baseline approaches (?, ?) and the current state-of-the-art approaches (?, ?, ?). Given different architectures, neural models also benefit from techniques like memory augmentation and attention mechanisms. For tasks that require reasoning based on multiple supporting facts, e.g., bAbI (?), memory-augmented networks like memory networks (?) and recurrent entity networks (?) have been shown effective. And for tasks that require alignment between input and output, e.g., textual entailment tasks like SNLI (?), or capturing long-term dependencies, it is often beneficial to adopt attention mechanisms to models.

Next, we give examples of current state-of-the-art systems, particularly focusing on three aspects: memory augmentation, attention mechanism, and pre-trained models and representations.

4.2.1 Memory Augmentation

Mentioned earlier, a popular type of approach for tasks which require comprehending passages with several state changes or supporting facts, such as bAbI (?) or ProPara (?), involves augmenting systems with a dynamic memory which may be maintained over time to represent the changing state of the world. We discuss memory networks (?), recurrent entity networks (?), and the recent Knowledge Graph-Machine Reading Comprehension (KG-MRC) system (?) to highlight key characteristics of such an approach.

Memory networks.

Memory networks by ? (?), introduced as high-performing baseline approaches to both bAbI (?) and CBT (?), track the world state by adding a long-term memory component to the typical network architecture. A memory network consists of a memory array, an input feature map, a generalization module which updates the memory array given new input, an output feature map, and a response module which converts output to the appropriate response or action. The networks can take characters, words, or sentences as input. Each component of the network can take different forms, but a common implementation is for them to be neural networks, in which case the network is called a MemNN.

The ability to maintain a long-term memory provides more involved tracking of the world state and context. On the CBT cloze task (?), it is demonstrated that memory networks can outperform primarily RNN- and LSTM-based approaches in predicting missing named entities and common nouns, and this is because memory networks can efficiently leverage a wider context than these approaches in making inferences. When tested on bAbI, MemNNs also achieved high performance and outperformed LSTM baselines, and on some tasks were able to achieve high performance with fewer training examples than provided (?).

Recurrent entity networks.

The recurrent entity network (EntNet) by ? (?) is composed of several dynamic memory cells, where each cell learns to represent the state or properties concerning entities mentioned in the input. Each cell is a gated RNN which only updates its content when new information relevant to the particular entity is received. Further, EntNet’s memory cells run in parallel, allowing multiple locations of memory to be updated at the same time.

EntNet is, to our knowledge, the first model to pass all twenty tasks in bAbI (?), and achieves impressive results on CBT (?), outperforming memory network baselines on both benchmarks. EntNet is also used as a baseline in the Story Commonsense benchmark (?) in an attempt to track the motivations and emotions of characters in stories from ROCStories (?) with some success. An advantage of EntNet is that it maintains and updates the state of the world as it reads the text, unlike memory networks, which can only preform reasoning when the entire supporting text and the question are processed and loaded to the memory. For example, given a supporting text with multiple questions, EntNet does not need to process the input text multiple times to answer these questions, while memory networks need to re-process the whole input for each question.

While EntNet achieves state-of-the-art performance on bAbI, it does not perform so well on ProPara (?), another benchmark which requires tracking the world state. According to ? (?), a drawback of EntNet is that while it maintains memory registers for entities, it has no separate embedding for individual states of entities over time. They further explain that EntNet does not explicitly update coreferences in memory, which can certainly cause errors when reading human-authored text which is rich in coreference, as opposed to the simply-structured, automatically-generated bAbI data.


A more recent model, the Knowledge Graph-Machine Reading Comprehension (KG-MRC) system from ? (?), maintains a dynamic memory similar to memory networks. However, this memory is in the form of knowledge graphs generated after every sentence of procedural text, leveraging research efforts from the area of information extraction. Generated knowledge graphs are bipartite, connecting entities in the paragraph with their locations (currently, it only captures the location relation). Connections between entities and locations are updated to generate a new graph after each sentence.

According to the official ProPara leaderboard1313 13, the Knowledge Graph-Machine Reading Comprehension (KG-MRC) system from ? (?) achieves the highest accuracy on the benchmark, reported as 47.0% in the paper. It provides advantages over ProStruct, the previous state of the art for ProPara (?). While ProStruct manually enforces hard and soft commonsense constraints, further investigation shows that KG-MRC learns these constraints automatically, violating them less often than ProStruct ? (?). This shows that the use of the recurrent graph representation helps the model learn these constraints, perhaps better than can be manually enforced. Further, since KG-MRC includes a trained reading comprehension model, it can likely better track changes in coreference which often occur in these texts.

4.2.2 Attention Mechanism

Since the first application of attention mechanism for neural machine translation (?), attention has been used widely in NLP tasks, especially to capture the alignment between an input (encoder) and an output (decoder). Modeling attention has several advantages. It allows the decoder to directly go to and focus on certain parts of the input. It alleviates the vanishing gradient problem by providing a way to account for states far away in the input sequence. Another advantage is that the attention distribution learned by the model automatically provides an alignment between inputs and outputs which allows some understanding of their relations. Because of these advantages, attention mechanisms have been successfully applied to commonsense benchmark tasks.

Attention in RNN/CNN.

Adding an attention mechanism to RNNs, LSTMs, CNNs, and more has been shown to improve performance on various tasks compared to their vanilla models (?). It is particularly successful for tasks which require alignment between input and output, such as various RTE tasks which require the modeling of context and hypothesis, and reading comprehension questions which refer directly to an accompanying passage, like MCScript (?).

For example, the official leaderboard1414 14 for the SNLI task (?) reports that the best performing system (?), partly inspired by DenseNet (?), uses a densely connected RNN while concatenating features from an attention mechanism to recurrent features in the network. As discussed by ? (?), the attentive weights resulting from this alignment help the system make accurate entailment and contradiction decisions for highly similar pairs of sentences. One such example given is the context sentence "Several men in front of a white building" compared to the hypothesis sentence "Several people in front of a gray building." For the MCScript task ? (?) used in SemEval 2018, online results1515 15\#results indicate that the best performer (?) achieved an accuracy of 84.13% using a bidirectional LSTM-based approach with an attention layer.

RNNs with attention also have limitations particularly when the alignment between inputs and outputs is not straightforward. For example, ? (?) found that yes/no questions were particularly challenging in MCScript, as they require a special handling of negation and deeper understanding of the question. Further, since the multiple choices to answer questions in MCScript are human-authored instead of extracted directly from the accompanying passage like other QA benchmarks, this caused some difficulties in connecting words back to the passage that stemming alone could not fix.

Self-attention in transformers.

In lieu of adding attention mechanisms to a typical neural model such as an RNN, LSTM, or CNN, the recently proposed transformer architecture is composed entirely of attention mechanisms (?). 1616 16 An excellent post on the implementation of the transformer can be found at One key difference is the self-attention layer in both the encoder and the decoder. For each word position in an input sequence, self-attention allows it to attend to all positions in the sequence to better encode the word. It provides a method to potentially capture long-range dependencies between words, such as syntactic, semantic, and coreference relations. Furthermore, instead of performing a single attention function, the transformer performs multi-head attention in the sense that it applies the attention function multiple times with different linear projections, and therefore allows the model to jointly capture different attentions from different subspaces, e.g., jointly attend to information that might indicate both coreference and syntactic relations.

Another big benefit of the transformer is its suitability for parallel computing. The sequence models such as RNN and LSTM by their sequential nature make it difficult for parallelization. The transformer, which uses attention to capture global dependencies between inputs and outputs, maximizes the amount of parallelizable computations. Empirical results on NLP tasks such as machine translation and constituency parsing have shown impressive performance gains with a significant reduction in training costs (?). Transformers have recently been used in pre-trained contextual models like GPT (?) and BERT (?) to achieve state-of-the-art performance on many commonsense benchmarks.

4.2.3 Pre-Trained Models and Representations

One of the most exciting recent advances in NLP is the development of pre-trained models and embeddings that can be used as features or further fine-tuned for downstream tasks. These models are often trained based on a large amount of unsupervised textual data. The earlier pre-trained word embedding models such as word2vec (?) and GloVe (?) have been widely applied. However, these models are context independent, meaning the same embedding is used in different contexts, and they therefore cannot capture different word senses. More recent work has addressed this problem by pre-training models that can provide a word embedding based on context. The most representative models are ELMo, GPT, and BERT. Next, we give a brief overview of these models and summarize their performance on the selected benchmark tasks.


ELMo’s characteristic contribution is its contextual word embeddings, which each rely on the entire input sentence they belong to ? (?). These embeddings are calculated from learned weights in a bidirectional LSTM which is pre-trained on the supervised One Billion Word language modeling benchmark (?). Simply adding these embeddings to the input features of previous state-of-the-art systems improved performance, suggesting that they are indeed successful in representing word context. An investigation by the authors shows that ELMo embeddings make it possible to identify word sense and POS, further supporting this.

When the ELMo embedding system was originally released, it helped exceed the state of the art on several benchmarks in the areas of question answering, textual entailment, and sentiment analysis. These included SQuAD (?) and SNLI (?). All of these approaches have since been exceeded. ELMo still commonly appears in baseline approaches to benchmarks, e.g., to SWAG (?) and CommonsenseQA (?). It is often combined with the enhanced LSTM-based models such as the ESIM model (?), or the CNN- and bidirectional GRU-based models such as the DocQA model (?).


GPT by ? (?) uses the transformer architecture originally proposed by ? (?), in particular the decoder. This system is pre-trained on a large amount of open online data unsupervised, then fine-tuned to various benchmark datasets. Unlike ELMo, GPT learns its contextual embeddings in an unsupervised setting, which allows it to learn features of language without restrictions. The creators found that this technique produced more discriminative features than supervised pre-training when applied to a large magnitude of clean data. The transformer architecture itself can then be easily fine-tuned in a supervised setting for downstream tasks, which ELMo is not as suitable for, and instead should be used for input features to a separate task-specific model.

When GPT was first released, it pushed the state of the art forward on 12 benchmarks in textual entailment, semantic similarity, sentiment analysis, commonsense reasoning, and more. These included SNLI (?), MultiNLI (?), SciTail (?), the Story Cloze Test (?), COPA (?), and GLUE (?). To our knowledge, GPT is still the highest-performing documented system for the Story Cloze Test and COPA, achieving 86.5% and 78.6% accuracy, respectively. GPT holds high positions on several other commonsense benchmark leaderboards as well. It is often commonly used as a baseline for new benchmarks, e.g., CommonsenseQA (?).

? (?) identify several limitations of the model. First, the model has high computational requirements, which is undesirable for obvious reasons. Second, data from the Internet, which the model is pre-trained on, are incomplete and sometimes inaccurate. Lastly, like many deep learning NLP models, GPT shows some issues with generalizing over data with high lexical variation.

To improve the generalization ability and develop upon the use of unsupervised training settings, the larger GPT 2.0 was recently released (?), which is highly similar to the original implementation, but with significantly more parameters and formulated as a language model. The expanded model has achieved new state-of-the-art results on several language modeling tasks, including CBT (?) and the 2016 Winograd Schema Challenge (?). Further, it exceeds three out of four baseline approaches to CoQA (?) in an unsupervised setting, i.e., trained only on documents and questions, not answers. In a supervised training setting, the model would be fed the answers directly with the questions so that model parameters could be updated based upon correlations between them. The model instead learns to perform tasks by observing natural language demonstrations of them without being told where the questions and answers are. This way, it is ensured that the model is not overfitting to superficial correlations between questions and answers. A qualitative investigation into model predictions suggests that some heuristics are indeed being learned to answer questions. For example, if asked a "who" question, the model has learned to return the name of a person mentioned in the passage that the question is posed on. This provides some evidence of the model performing reasoning processes similar to what the benchmarks intend.


Recently, the BERT model (?) exceeded the state-of-the-art accuracy on several benchmarks including GLUE (?), SQuAD 1.1 (?), and SWAG (?) benchmarks. According to the GLUE leaderboard1717 17, BERT originally achieved an overall accuracy of 80.4% on the multi-task benchmark. Meanwhile, according to the SQuAD leaderboard,1818 18 BERT solved SQuAD 1.1 with 87.433% exact-match accuracy, exceeding human performance by 5.13%, and it solved SWAG with 86.28% accuracy according to the SWAG leaderboard, 1919 19 greatly exceeding the previous state of the art set by GPT (?).

After its initial release, BERT further topped several new leaderboards such as OpenBookQA (?), CLOTH (?), SQuAD 2.0 (?), CoQA (?), ReCoRD (?), and SciTail (?). The deeper BERTlarge model introduced alongside the base model topped the leaderboards of OpenBookQA2020 20\_book\_qa/submissions/public (?) and CLOTH2121 21 (?), achieving accuracies of 60.40% and 86.0%, respectively. According to the SQuAD leaderboard, various implementations of BERT had beaten the state of the art performance on SQuAD 2.0 more than a dozen times at the time of writing, the highest performance coming from an updated implementation which achieved an F-measure of 89.147. A modified ensemble implementation of BERT tops the CoQA leaderboard2222 22 with an accuracy of 86.8%, and a single-model implementation tops the ReCoRD leaderboard2323 23 with an accuracy of 74.76%. Most recently, a new implementation of BERT with updated loss functions is topping the GLUE leaderboard with 83.3% accuracy, beating the original implementation. Most of this progress comes in a span of just a few months.

Baseline \theadSimple
Baseline \theadELMo \theadGPT \theadBERT \theadBigBird \theadHuman
SQuAD 1.1 1.3 40.4 81.0 87.4 82.3
SWAG 25.0 59.2 78.0 86.3 88.0
SciTail 60.4 79.6 88.3 94.1
ReCoRD 18.6 45.4 73.0 91.3
OpenBookQA 25.0 50.2 60.4 91.7
CLOTH 25.0 70.7 86.0 85.9
GLUE 60.3 72.8 83.3 83.1 87.1
SQuAD 2.0 48.9 63.4 86.7 86.9
CoQA 1.3 65.1 86.8 88.8
SNLI 33.3 77.8 89.3 89.9 91.1
Table 3: Comparison of exact-match accuracy achieved on various benchmarks by a random or majority-choice baseline, the best-performing baseline presented in the original paper for each benchmark, ELMo, GPT, BERT, BigBird, and humans. ELMo refers to the highest-performing listed approach using ELMo embeddings. Best system performance on each benchmark in bold. Information extracted from leaderboards linked to in Section 4.2 at time of writing (March 2019), and original papers for benchmarks introduced in Section 2.

BERT provides several advantages over past state-of-the-art systems with pre-trained contextual embeddings. First, it is pre-trained on larger data than previous competitive systems like GPT (?). Where GPT is pre-trained with a large text corpus, BERT is trained with two larger corpora of passages on two tasks: a cloze task where input tokens are randomly masked, and a sentence ordering task where given two sentences, the system must predict whether the second sentence could come directly after the first. This sort of transfer learning from large-scale supervised tasks has been demonstrated several times to be effective in NLP problems. Similar to GPT, BERT was further fine-tuned for each of the 11 commonsense benchmark tasks it originally attempted.

Second, BERT uses a bidirectional form of the transformer architecture (?) for pre-training contextual embeddings. This better captures context, an advantage that previous competitive approaches like GPT (?) and ELMo (?) do not have. Instead, GPT uses a left-to-right transformer, while ELMo uses a concatenation of left-to-right and right-to-left LSTMs.

Lastly, BERT’s input embeddings are superior at representing context, which besides a traditional embedding for tokens, also capture a learned embedding for each sentence (i.e., if a pair of sentences is the input) applied to each token, and learned positional embeddings for tokens. Consequently, its embedding can capture each unique word and its context, perhaps in a more sophisticated way than previous systems. Further, it can uniquely represent a sentence or pair of sentences, advantageous for solving a wide variety of language processing tasks in question answering, textual entailment, and more.

Recently, another BERT-based system called BigBird or the Multi-Task Deep Neural Network (MT-DNN) by ? (?) is achieving competitive performance on several leaderboards such as for SciTail 2424 24 with an accuracy of 94.07%, SNLI (?) 2525 25 with an accuracy of 91.1%, and GLUE with an accuracy of 83.1%, beating the original implementation of BERT. It appears that the performance gain from the BigBird model can mostly be attributed to adding multi-task learning during fine-tuning procedures. This is done through task-specific layers which generate representations for specific tasks, e.g., text similarity and sentence pair classification. While pre-training the bi-directional transformer helps BERT learn universal word representations which are applicable across several tasks, multi-task learning prevents the model from overfitting to particular tasks during fine-tuning, thus allowing it to leverage more cross-task data.

BERT and its variations are currently the state of the art on nearly all commonsense benchmark tasks, even exceeding human performance in some cases. According to ? (?), a goal of future work will be to determine whether BERT truly captures the intended semantic phenomena in benchmark datasets.Further, as the BigBird model was able to improve performance on several benchmarks by adding task-specific layers and enabling multi-task learning, BERT may be a bit too task-invariant, and potentially missing helpful information that comes from differentiating between specific tasks. It could likely benefit from further investigation into multi-task learning approaches such as BigBird. Figure 3 compares performance from ELMo, GPT, BERT, and BigBird on various benchmarks.

When to fine-tune.

These new pre-trained contextual models are applied to benchmark tasks in different ways. Particularly, while ELMo has been traditionally used to generate input features for a separate task-specific model, BERT-based models are typically fine-tuned on various tasks and applied to them directly. Understanding why these choices were made are important for the further development of these models. ? (?) investigate this difference in training the two models and compare their performance both when just extracting their output as features for another model, against when fine-tuning them to be used directly on various tasks. Their results show that ELMo’s LSTM architecture can actually be fine-tuned and applied directly to downstream tasks like BERT can with some success, although it is more difficult to perform this fine-tuning on ELMo. Further, performance on sentence pair classification tasks like MultiNLI (?) and SICK (?) is shown to be better when the contextual embeddings generated by ELMo are instead used as input features to a separate task-specific architecture. They infer that this may be because the LSTM architecture of ELMo must consider tokens sequentially, rather than being able to compare all tokens to each other across sentence pairs like BERT’s transformer architecture can. BERT’s output can also be used as features for a task-specific model with some success, and it actually outperforms ELMo in most of the studied tasks when used in this way, likely for the same reason. It is important to note, however, that performance on sentence similarity tasks like the Microsoft Research Paraphrase Corpus (?) is significantly better when the model is fine-tuned to the task.

4.3 Incorporating External Knowledge

Most of recent approaches have relied on the benchmark datasets (e.g., training data) to build models for reasoning and inference. Despite the availability of knowledge resources discussed in Section 3, few of them were actually applied to solve the benchmark tasks. WordNet (?) is perhaps the mostly applied lexical resource, and its word relations are particularly useful for textual entailment problems. WordNet has made an appearance in earlier approaches all throughout the RTE Challenges (?, ?, ?, ?, ?, ?), and more recently used by a competitive approach to the 2016 Winograd Schema Challenge (?, ?). The well-known and popular common knowledge resources such as DBpedia (?) and YAGO (?) have been used in creating benchmarks (?, ?), however, have not been directly applied to solve the benchmark tasks.

ConceptNet (?) and Cyc (?) are by far the most talked-about commonsense knowledge resources available. However, Cyc does not seem to appear in any of our surveyed approaches, while ConceptNet has been occasionally applied. For example, OpenBookQA (?) uses ConceptNet in a neural baseline approach, which often retrieves additional common and commonsense knowledge facts not included in benchmark data. ConceptNet is most often used to create knowledge-enhanced word embeddings. A neural model applied to COPA leverages commonsense knowledge from ConceptNet (?) through ConceptNet-based embeddings which were generated by applying the word2vec skip-gram model (?) to commonsense knowledge tuples in ConceptNet (?). A recent approach to the Winograd Schema Challenge from (?) uses a similar technique, as well as a baseline approach to SWAG (?). Though ConceptNet is demonstrated to be useful for such purposes, still very few state-of-the-art approaches actually use it, rather they gain commonsense knowledge relations through benchmark training data only. This brings up some important questions as to how to incorporate external knowledge in modern neural approaches and how to acquire relevant external knowledge for the tasks at hand.

5 Other Related Benchmarks

While this paper intends to cover language understanding tasks for which some external knowledge or advanced reasoning beyond linguistic context is required, many related benchmarks have not been covered. First of all, nearly all language understanding benchmarks developed throughout the last couple decades could benefit from commonsense knowledge and reasoning. Second, as language communication is integral to other perception and reasoning systems, recent years have also seen an increasing number of benchmark tasks that combine language and vision.

Language-related tasks.

Many early corpora for classical NLP tasks such as semantic role labeling, relation extraction, and paraphrase may also require commonsense knowledge and reasoning, though this was not emphasized or investigated at the time. For example, in creating the Microsoft Research Paraphrase Corpus, ? (?) found that the task of annotating text pairs was difficult to streamline because this often required commonsense, suggesting that the paraphrases within the corpus require commonsense knowledge and reasoning to identify. Some such tasks are actually included in the multi-task benchmarks like Inference is Everything (?), GLUE (?), and DNC (?). Other examples of related textual benchmarks include QuAC (?), which relies on conversation discourse for robust contextual question answering, but does not require commonsense in the way that the similar CoQA benchmark does (?), and a conversation dataset created by ? (?) which explores the use of commonsense knowledge from ConceptNet in producing higher quality and more relevant responses for chatbots.

Besides English, there are also benchmarks in other languages. For example, there have been RTE datasets created in Italian2626 26 See and Portuguese2727 27 See, and cross-lingual RTE datasets have appeared in several SemEval shared tasks over the years (?, ?, ?) to encourage progress in machine translation and content synchronization. There also exist various cross-lingual knowledge resources, including the latest version of ConceptNet (?), which contains relations from several multilingual resources.

Visual benchmarks.

Commonsense plays an important role in integrating language and vision, for example, grounding language to perception (?), language-based justification for action recognition (?), and visual question answering (?). Visual commonsense benchmarks include VQA benchmarks like the original VQA (?), other similar VQA datasets like Visual7W (?), and similar datasets with synthetic images, such as CLEVR (?) and the work by ? (?). They also include the tasks of commonsense action recognition and justification, which are found in the dataset by ? (?), and Visual Commonsense Reasoning (VCR) by ? (?). These are all image-based, but we are also beginning to see similar video-based datasets, such as Something Something (?), which aims to evaluate visual commonsense through over 100,000 videos portraying everyday actions. We have also seen the vision-and-language navigation (VLN) task such as Room-to-Room (R2R) by ? (?). Such benchmarks are important to promote progress in physically grounded commonsense knowledge and reasoning.

6 Discussion and Conclusion

The availability of data and computing resources and the rise of new learning and inference methods make this an unprecedentedly exciting time for research on commonsense reasoning and natural language understanding. As the research field moves forward, here are a few things we think are important to pursue in the future.

Among different kinds of knowledge, two types of commonsense knowledge are considered fundamental for human reasoning and decision making: intuitive psychology and intuitive physics. Several benchmarks are geared towards intuitive psychology, e.g., Triangle-COPA (?), Story Commonsense (?), and Event2Mind (?). Reasoning with intuitive physics is scattered in different benchmarks such as bAbI (?) and SWAG (?). Understanding how such commonsense knowledge is developed and acquired in humans and how they are related to human language production and comprehension may shed light on computational models for language processing.

In addition, besides benchmark tasks in a written language form as discussed in this paper, it may be worthwhile to also explore new tasks that involve artificial agents (in either a simulated world or the real physical world) which can use language to communication, to perceive, and to act. Some examples can be interactive task learning (?) or embodied question answering (?). As commonsense knowledge is so intuitive for humans, it would be difficult even for researchers to identify and formalize such kind of knowledge. Working with agents and observing their abilities and limitations in understanding language and grounding language to their own sensorimotor skills will allow researchers to better understand the space of commonsense knowledge and tackle the problem of knowledge acquisition accordingly.

One challenge of the current trend of work is the disconnect between commonsense knowledge resources and approaches taken to tackle those benchmark tasks. Most approaches, particularly neural approaches, only accrue knowledge or learn models from training data, a method that critics are unsure can result in comparable reasoning ability on the level of humans (?). Although there exist many knowledge bases designed for commonsense reasoning, most of them are not used directly to solve the benchmark tasks, except for very few. One likely reason is that these knowledge bases do not cover the kind of knowledge that is required to solve those tasks. This was discovered, for example, of ConceptNet (?) in creating the Event2Mind benchmark (?). To address this problem, several methods have been proposed for leveraging incomplete knowledge bases. One method mentioned in Section 3 is AnalogySpace (?), which uses principle component analysis to make analogies to smooth missing commonsense axioms. Another example is memory comparison networks (?), which allow machines to generalize over existing temporal relations in knowledge resources in order to acquire new relations. Future work will need to come up with more solutions to handle the long-tail phenomenon (?).

Another potential avenue to address the disconnection is to jointly develop benchmark tasks and construct knowledge bases. Recently, we are seeing the creation of knowledge graphs geared toward particular tasks, like the ATOMIC knowledge graph by ? (?) which expands upon data in the Event2Mind benchmark, and ideally provides the required relations that ConceptNet does not. We are also seeing the creation of benchmarks geared toward particular knowledge graphs, like CommonsenseQA (?), where questions were drawn from subgraphs of ConceptNet, thus encouraging the use of ConceptNet in approaching the benchmark tasks. A tighter coupling of benchmark tasks and knowledge resources will help understand and formalize the scope of knowledge needed, and facilitate the development and evaluation of approaches that can incorporate external knowledge.

As more benchmark tasks become available and performance on these tasks keeps growing, one central question is whether the technologies developed are in fact pushing the state-of-the-art or only learning superficial artifacts from the dataset. Better understanding of the behaviors of these models, especially deep learning models that achieve high performance, is critical. For example, when applied to DNC (?), the multi-task two-way entailment benchmark, InferSent achieves high accuracy on many of the benchmark’s tasks (sometimes over 90%) just by training and testing on only the hypothesis texts from the benchmark rather than both the context and hypothesis sentences. Since humans require both the context and hypothesis to perform textual entailment, this suggests that the model is learning obscure statistical biases in the data rather than performing actual reasoning. For another example, ? (?) propose an adversarial evaluation scheme for SQuAD (?) which randomly inserts distractor sentences into passages which do not change the meaning of the passage, and shows that high-performing models on the benchmark drop significantly. ? (?) also highlights several more models which have been shown to be spurious, identifying several recent works which show that high-performing modern NLP systems can break down due to small, inconsequential changes in inputs. BERT may suffer from a similar issue, as its deep, bidirectional architecture makes its reasoning processes highly unexplainable, and thus it is quite opaque why certain conclusions are made by the model. It is unclear if BERT is capturing semantic phenomena or again learning statistical biases. According to the creators of BERT, this will be a subject of future research.

Lastly, like most deep learning applications, a significant amount of effort has been spent on tuning parameters to improve the performance. We have also recently seen in other AI subfields that more sophisticated models may lead to better performance on a particular benchmark, but simpler models with better parameter tuning may later lead to comparable results. For example, in image classification, a study by ? (?) shows that nearly all improvements of recent deep neural networks over earlier bag-of-features classifiers come from better fine-tuning rather than improvements in decision processes. NLP models may be similarly vulnerable to this. More efforts on theoretical understanding and motivation for model design and parameter tuning would be beneficial.


  • Agrawal et al. Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, C. L., Batra, D., and Parikh, D. (2015). VQA: Visual Question Answering.  In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV 2015), Santiago, Chile. IEEE.
  • Agrawal et al. Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, C. L., Parikh, D., and Batra, D. (2017). VQA: Visual Question Answering.  Int. J. Comput. Vision, 123(1), 4–31.
  • Anderson et al. Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., and van den Hengel, A. (2018). Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments.  In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA. IEEE.
  • Andrade et al. Andrade, D., Bai, B., Rajendran, R., and Watanabe, Y. (2018). Leveraging knowledge bases for future prediction with memory comparison networks.  AI Communications.
  • Auer et al. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., and Ives, Z. (2007). DBpedia: A Nucleus for a Web of Open Data.  In Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., and Cudré-Mauroux, P. (Eds.), Proceedings of the Semantic Web Challenge 2007 Co-Located with ISWC 2007 + ASWC 2007, Lecture Notes in Computer Science, pp. 722–735, Busan, Korea. Springer Berlin Heidelberg.
  • Bahdanau et al. Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate.  In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015).
  • Banarescu et al. Banarescu, L., Bonial, C., Cai, S., Georgescu, M., Griffitt, K., Hermjakob, U., Knight, K., Koehn, P., Palmer, M., and Schneider, N. (2013). Abstract Meaning Representation for Sembanking.  In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pp. 178–186, Sofia, Bulgaria. Association for Computational Linguistics.
  • Bar-Haim et al. Bar-Haim, R., Dagan, I., Dolan, B., Ferro, L., Giampiccolo, D., Magnini, B., and Szpektor, I. (2006). The Second PASCAL Recognising Textual Entailment Challenge.  In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, Venice, Italy.
  • Bentivogli et al. Bentivogli, L., Clark, P., Dagan, I., and Giampiccolo, D. (2010). The Sixth PASCAL Recognizing Textual Entailment Challenge.  In Proceedings of the Third Text Analysis Conference (TAC 2010), Gaithersburg, MD, USA. National Institute of Standards and Technology.
  • Bentivogli et al. Bentivogli, L., Clark, P., Dagan, I., and Giampiccolo, D. (2011). The Seventh PASCAL Recognizing Textual Entailment Challenge.  In Proceedings of the Fourth Text Analysis Conference (TAC 2011), Gaithersburg, MD, USA. National Institute of Standards and Technology.
  • Bentivogli et al. Bentivogli, L., Dagan, I., Dang, H. T., Giampiccolo, D., and Magnini, B. (2009). The Fifth PASCAL Recognizing Textual Entailment Challenge.  In Proceedings of the Second Text Analysis Conference (TAC 2009), p. 15, Gaithersburg, MD, USA. National Institute of Standards and Technology.
  • Bollacker et al. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., and Taylor, J. (2008). Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge.  In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pp. 1247–1250, New York, NY, USA. ACM.
  • Bos et al. Bos, J., Basile, V., Evang, K., Venhuizen, N. J., and Bjerva, J. (2017). The Groningen Meaning Bank.  In Handbook of Linguistic Annotation, pp. 463–496. Springer.
  • Bowman et al. Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. (2015). A large annotated corpus for learning natural language inference.  In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), pp. 632–642, Lisbon, Portugal. Association for Computational Linguistics.
  • Brendel and Bethge Brendel, W., and Bethge, M. (2019). Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet.  In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA.
  • Cambria et al. Cambria, E., Olsher, D., and Rajagopal, D. (2014a). SenticNet 3: A Common and Common-Sense Knowledge Base for Cognition-Driven Sentiment Analysis.  In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence (AAAI-14), Québec City, QC, Canada. AAAI Press.
  • Cambria et al. Cambria, E., Song, Y., Wang, H., and Howard, N. (2014b). Semantic Multidimensional Scaling for Open- Domain Sentiment Analysis.  IEEE Intelligent Systems, 29(2), 44–51.
  • Cambria et al. Cambria, E., Song, Y., Wang, H., and Hussain, A. (2011). Isanette: A Common and Common Sense Knowledge Base for Opinion Mining.  In 2011 IEEE 11th International Conference on Data Mining Workshops, pp. 315–322, Vancouver, BC, Canada. IEEE.
  • Cambria et al. Cambria, E., Speer, R., Havasi, C., and Hussain, A. (2010). SenticNet: A Publicly Available Semantic Resource for Opinion Mining.  In AAAI Fall Symposium on Commonsense Knowledge, Menlo Park, CA, USA. AAAI Press.
  • Carlson et al. Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Jr, E. R. H., and Mitchell, T. M. (2010). Toward an Architecture for Never-Ending Language Learning.  In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10), Atlanta, GA, USA. AAAI Press.
  • Cer et al. Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., and Specia, L. (2017). SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation.  In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 1–14, Vancouver, BC, Canada. Association for Computational Linguistics.
  • Chai et al. Chai, J. Y., Gao, Q., She, L., Yang, S., Saba-Sadiya, S., and Xu, G. (2018). Language to Action: Towards Interactive Task Learning with Physical Agents.  In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI 2018), pp. 2–9, Stockholm, Sweden. International Joint Conferences on Artificial Intelligence Organization.
  • Chambers and Jurafsky Chambers, N., and Jurafsky, D. (2008). Unsupervised Learning of Narrative Event Chains.  In Proceedings of ACL-08: HLT, Columbus, OH, USA. Association for Computational Linguistics.
  • Chelba et al. Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P., and Robinson, T. (2014). One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling.  In 15th Annual Conference of the International Speech Communication Association (INTERSPEECH 2014), Singapore, Singapore. ISCA Archive.
  • Chen et al. Chen, Q., Zhu, X., Ling, Z., Wei, S., Jiang, H., and Inkpen, D. (2017). Enhanced LSTM for Natural Language Inference.  In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), Vancouver, BC, Canada. Association for Computational Linguistics.
  • Chen et al. Chen, Z., Cui, Y., Ma, W., Wang, S., Liu, T., and Hu, G. (2018). HFL-RC System at SemEval-2018 Task 11: Hybrid Multi-Aspects Model for Commonsense Reading Comprehension.  arXiv: 1803.05655.
  • Chklovski Chklovski, T. (2003). Learner: A System for Acquiring Commonsense Knowledge by Analogy.  In Proceedings of the 2nd International Conference on Knowledge Capture (K-CAP ’03), K-CAP ’03, pp. 4–12, New York, NY, USA. ACM.
  • Chklovski and Pantel Chklovski, T., and Pantel, P. (2004). VerbOcean: Mining the Web for Fine-Grained Semantic Verb Relations.  In Lin, D., and Wu, D. (Eds.), Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), pp. 33–40, Barcelona, Spain. Association for Computational Linguistics.
  • Choi et al. Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W.-t., Choi, Y., Liang, P., and Zettlemoyer, L. (2018). QuAC : Question Answering in Context.  In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Association for Computational Linguistics.
  • Clark and Gardner Clark, C., and Gardner, M. (2018). Simple and Effective Multi-Paragraph Reading Comprehension.  In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), pp. 845–855, Melbourne, Australia. Association for Computational Linguistics.
  • Clark et al. Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. (2018). Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge.  arXiv: 1803.05457.
  • Dagan et al. Dagan, I., Glickman, O., and Magnini, B. (2005). The PASCAL Recognising Textual Entailment Challenge.  In Quiñonero-Candela, J., Dagan, I., Magnini, B., and d’Alché-Buc, F. (Eds.), Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, Vol. 3944, pp. 177–190. Springer Berlin Heidelberg, Berlin, Heidelberg.
  • Das et al. Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., and Batra, D. (2018). Embodied Question Answering.  In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA. IEEE.
  • Das et al. Das, R., Munkhdalai, T., Yuan, X., Trischler, A., and McCallum, A. (2019). Building Dynamic Knowledge Graphs from Text using Machine Reading Comprehension.  In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA.
  • Davis Davis, E. (2017). Logical Formalizations of Commonsense Reasoning: A Survey.  Journal of Artificial Intelligence Research, 59, 651–723.
  • Davis and Marcus Davis, E., and Marcus, G. (2015). Commonsense reasoning and commonsense knowledge in artificial intelligence.  Commun. ACM, 58(9), 92–103.
  • Davis et al. Davis, E., Morgenstern, L., and Ortiz, C. (2018). The Winograd Schema Challenge.
  • Davis et al. Davis, E., Morgenstern, L., and Ortiz, C. L. (2017). The First Winograd Schema Challenge at IJCAI-16.  AI Magazine; La Canada, 38(3), 97–98.
  • Devlin et al. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.  arXiv: 1810.04805.
  • Dolan and Brockett Dolan, W. B., and Brockett, C. (2005). Automatically Constructing a Corpus of Sentential Paraphrases.  In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), Jeju Island, Korea.
  • Dzikovska et al. Dzikovska, M. O., Nielsen, R. D., Brew, C., Leacock, C., Giampiccolo, D., Bentivogli, L., Clark, P., Dagan, I., and Dang, H. T. (2013). SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge.  In Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval-2013), p. 13, Atlanta, GA, USA. Association for Computational Linguistics.
  • Etzioni et al. Etzioni, O., Banko, M., Soderland, S., and Weld, D. S. (2008). Open information extraction from the web.  Communications of the ACM, 51(12), 68.
  • Etzioni et al. Etzioni, O., Cafarella, M., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D. S., and Yates, A. (2005). Unsupervised named-entity extraction from the Web: An experimental study.  Artificial Intelligence, 165(1), 91–134.
  • Fellbaum Fellbaum, C. (1999). WordNet : An Electronic Lexical Database, Vol. 2nd printing of Language, Speech, and Communication. A Bradford Book, Cambridge, Mass.
  • Fillmore et al. Fillmore, C. J., Baker, C. F., and Sato, H. (2002). The FrameNet Database and Software Tools..  In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), Las Palmas, Canary Islands - Spain. European Language Resources Association (ELRA).
  • Fouhey et al. Fouhey, D. F., Kuo, W., Efros, A. A., and Malik, J. (2018). From Lifestyle VLOGs to Everyday Interactions.  In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA. IEEE.
  • Gabrilovich et al. Gabrilovich, E., Ringgaard, M., and Subramanya, A. (2013). FACC1: Freebase annotation of ClueWeb corpora, Version 1 (Release date 2013-06-26, Format version 1, Correction level 0).  Note: http://lemurproject. org/clueweb09/FACC1/Cited by, 5.
  • Gao et al. Gao, Q., Doering, M., Yang, S., and Chai, J. (2016). Physical Causality of Action Verbs in Grounded Language Understanding.  In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), pp. 1814–1824, Berlin, Germany. Association for Computational Linguistics.
  • Giampiccolo et al. Giampiccolo, D., Dang, H. T., Magnini, B., Dagan, I., and Dolan, B. (2008). The Fourth PASCAL Recognizing Textual Entailment Challenge.  In Proceedings of the First Text Analysis Conference (TAC 2008), p. 9, Gaithersburg, MD, USA. National Institute of Standards and Technology.
  • Giampiccolo et al. Giampiccolo, D., Magnini, B., Dagan, I., and Dolan, B. (2007). The Third PASCAL Recognizing Textual Entailment Challenge.  In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, RTE ’07, pp. 1–9, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Glickman Glickman, O. (2006). Applied Textual Entailment. Ph.D. Thesis, Bar Ilan University.
  • Gordon Gordon, A. S. (2016). Commonsense Interpretation of Triangle Behavior.  In Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA. AAAI Press.
  • Goyal et al. Goyal, R., Kahou, S. E., Michalski, V., Materzyńska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., and Memisevic, R. (2017). The "something something" video database for learning and evaluating visual common sense.  In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV 2017).
  • Gururangan et al. Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., and Smith, N. A. (2018). Annotation Artifacts in Natural Language Inference Data.  In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2018), pp. 107–112, New Orleans, LA, USA. Association for Computational Linguistics.
  • Hartshorne et al. Hartshorne, J. K., Bonial, C., and Palmer, M. (2013). The VerbCorner project: Toward an empirically-based semantic decomposition of verbs.  In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), pp. 1438–1442, Seattle, WA, USA. Association for Computational Linguistics.
  • Havasi et al. Havasi, C., Speer, R., and Alonso, J. B. (2007). ConceptNet 3: A flexible, multilingual semantic network for common sense knowledge.  In Recent Advances in Natural Language Processing (RANLP-07), Borovets, Bulgaria. Association for Computational Linguistics.
  • Heilbron et al. Heilbron, F. C., Escorcia, V., Ghanem, B., and Niebles, J. C. (2015). ActivityNet: A large-scale video benchmark for human activity understanding.  In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), pp. 961–970, Boston, MA, USA. IEEE.
  • Henaff et al. Henaff, M., Weston, J., Szlam, A., Bordes, A., and LeCun, Y. (2017). Tracking the World State with Recurrent Entity Networks.  In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), p. 14, Palais des Congrès Neptune, Toulon, France.
  • Hickl et al. Hickl, A., Bensley, J., Williams, J., Roberts, K., Rink, B., and Shi, Y. (2006). Recognizing textual entailment with LCC’s GROUNDHOG system.  In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, Venice, Italy.
  • Hill et al. Hill, F., Bordes, A., Chopra, S., and Weston, J. (2015). The Goldilocks Principle: Reading Children’s Books with Explicit Memory Representations.  arXiv: 1511.02301.
  • Hoffart et al. Hoffart, J., Suchanek, F. M., Berberich, K., and Weikum, G. (2012). YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia.  Artificial Intelligence, 194, 28–61.
  • Huang et al. Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. (2016). Densely Connected Convolutional Networks.  In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA. IEEE.
  • Iftene Iftene, A. (2008). UAIC Participation at RTE4.  In Proceedings of the First Text Analysis Conference (TAC 2008), Gaithersburg, MD, USA. National Institute of Standards and Technology.
  • Iyer et al. Iyer, S., Dandekar, N., and Csernai, K. (2017). First Quora Dataset Release: Question Pairs.
  • Jia and Liang Jia, R., and Liang, P. (2017). Adversarial Examples for Evaluating Reading Comprehension Systems.  In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), pp. 2021–2031. Association for Computational Linguistics.
  • Johnson et al. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C. L., and Girshick, R. (2017). CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning.  In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017). IEEE.
  • Kafle and Kanan Kafle, K., and Kanan, C. (2017). Visual Question Answering: Datasets, Algorithms, and Future Challenges.  Computer Vision and Image Understanding, 163, 3–20.
  • Khashabi et al. Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S., and Roth, D. (2018). Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences.  In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2018), New Orleans, LA, USA. Association for Computational Linguistics.
  • Khot et al. Khot, T., Sabharwal, A., and Clark, P. (2018). SciTail: A Textual Entailment Dataset from Science Question Answering.  In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), p. 9, New Orleans, LA, USA. AAAI Press.
  • Kim et al. Kim, S., Hong, J.-H., Kang, I., and Kwak, N. (2019). Semantic Sentence Matching with Densely-connected Recurrent and Co-attentive Information.  In Proceedings of the Thirty-Third AAAI Conference on Artificial Inteligence (AAAI-19), Honolulu, HI, USA. AAAI Press.
  • Kingsbury et al. Kingsbury, P., Palmer, M., and Marcus, M. (2002). Adding Semantic Annotation to the Penn TreeBank.  In Proceedings of the Second International Conference on Human Language Technology Research (HLT ’02), p. 5, San Diego, CA, USA. Morgan Kaufmann Publishers Inc.
  • Kotzias et al. Kotzias, D., Denil, M., De Freitas, N., and Smyth, P. (2015). From group to individual labels using deep features.  In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 597–606, Sydney, Australia. ACM, ACM.
  • Lai and Hockenmaier Lai, A., and Hockenmaier, J. (2014). Illinois-LH: A Denotational and Distributional Approach to Semantics.  In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval-2014), pp. 329–334, Dublin, Ireland. Association for Computational Linguistics and Dublin City University.
  • Lee et al. Lee, K., Artzi, Y., Choi, Y., and Zettlemoyer, L. (2015). Event detection and factuality assessment with non-expert supervision.  In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), pp. 1643–1648, Lisbon, Portugal. Association for Computational Linguistics.
  • Lenat and Guha Lenat, D. B., and Guha, R. V. (1989). Building Large Knowledge-Based Systems; Representation and Inference in the Cyc Project (1st edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
  • Levesque Levesque, H. J. (2011). The Winograd Schema Challenge.  In AAAI Spring Symposium on Logical Formalizations of Commonsense Reasoning, Stanford, CA, USA. AAAI Press.
  • Levesque et al. Levesque, H. J., Davis, E., and Morgenstern, L. (2012). The Winograd schema challenge..  In Proceedings of the Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning (KR2012), Rome, Italy. AAAI Press.
  • Levin Levin, B. (1993). English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press, Chicago.
  • Li et al. Li, B., Lee-Urban, S., Johnston, G., and Riedl, M. O. (2013). Story Generation with Crowdsourced Plot Graphs.  In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence (AAAI-13), AAAI’13, pp. 598–604, Bellevue, Washington. AAAI Press.
  • Lin Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries.  In Marie-Francine Moens, S. S. (Ed.), Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pp. 74–81, Barcelona, Spain. Association for Computational Linguistics.
  • Liu and Singh Liu, H., and Singh, P. (2004). ConceptNet — A Practical Commonsense Reasoning Tool-Kit.  BT Technology Journal, 22(4), 211–226.
  • Liu et al. Liu, Q., Jiang, H., Ling, Z.-H., Zhu, X., Wei, S., and Hu, Y. (2017). Combing Context and Commonsense Knowledge Through Neural Networks for Solving Winograd Schema Problems.  In 2017 AAAI Spring Symposium Series.
  • Liu et al. Liu, X., He, P., Chen, W., and Gao, J. (2019). Multi-Task Deep Neural Networks for Natural Language Understanding.  arXiv: 1901.11504.
  • Mahdisoltani et al. Mahdisoltani, F., Biega, J., and Suchanek, F. M. (2013). YAGO3: A Knowledge Base from Multilingual Wikipedias.  In Proceedings of the 6th Biennial Conference on Innovative Data Systems Research (CIDR 2013), Asilomar, CA, USA.
  • Manjunatha et al. Manjunatha, V., Saini, N., and Davis, L. S. (2018). Explicit Bias Discovery in Visual Question Answering Models.  arXiv: 1811.07789.
  • Manning and Hudson Manning, C., and Hudson, D. (2018). Towards real-world visual reasoning..
  • Marasović Marasović, A. (2018). NLP’s generalization problem, and how researchers are tackling it..
  • Marcus et al. Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A. (1993). Building a Large Annotated Corpus of English: The Penn Treebank.  Computational Linguistics, 1(2).
  • Marelli et al. Marelli, M., Menini, S., Baroni, M., Bentivogli, L., Bernardi, R., and Zamparelli, R. (2014a). A SICK cure for the evaluation of compositional distributional semantic models.  In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), Reykjavik, Iceland. European Language Resources Association (ELRA).
  • Marelli et al. Marelli, M., Bentivogli, L., Baroni, M., Bernardi, R., Menini, S., and Zamparelli, R. (2014b). SemEval-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment.  In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval-2014), Dublin, Ireland. Association for Computational Linguistics.
  • Maslow Maslow, A. H. (1943). A theory of human motivation.  Psychological Review, 50(4), 370–396.
  • Medelyan and Legg Medelyan, O., and Legg, C. (2008). Integrating Cyc and Wikipedia: Folksonomy meets rigorously defined common-sense.  In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (AAAI-08), Chicago, IL, USA. AAAI Press.
  • Mihaylov et al. Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. (2018). Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.  In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), pp. 2381–2391, Brussels, Belgium. Association for Computational Linguistics.
  • Mikolov et al. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space.  In Proceedings of the 1st International Conference on Learning Representations (ICLR 2013), Scottsdale, AZ, USA.
  • Miller Miller, G. A. (1995). WordNet: A Lexical Database for English.  Commun. ACM, 38(11), 39–41.
  • Miller et al. Miller, T., Hempelmann, C., and Gurevych, I. (2017). SemEval-2017 Task 7: Detection and Interpretation of English Puns.  In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 58–68, Vancouver, Canada. Association for Computational Linguistics.
  • Miltsakaki et al. Miltsakaki, E., Prasad, R., Joshi, A., and Webber, B. (2004). The Penn Discourse Treebank.  In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC-2004), p. 4, Lisbon, Portugal. Evaluation and Language Resources Distribution Agency.
  • Minard et al. Minard, A.-L., Speranza, M., Urizar, R., Altuna, B., van Erp, M., Schoen, A., and van Son, C. (2016). MEANTIME, the NewsReader Multilingual Event and Time Corpus.  In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC-2016), Portorož, Slovenia. European Language Resources Association (ELRA).
  • Mishra et al. Mishra, B. D., Huang, L., Tandon, N., Yih, W.-t., and Clark, P. (2018). Tracking State Changes in Procedural Text: A Challenge Dataset and Models for Process Paragraph Comprehension.  In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2018), New Orleans, LA, USA. Association for Computational Linguistics.
  • Morgenstern Morgenstern, L. (2016). Pronoun Disambiguation Problems.
  • Morgenstern et al. Morgenstern, L., Davis, E., and Ortiz, C. L. (2016). Planning, Executing, and Evaluating the Winograd Schema Challenge.  AI Magazine, 37(1), 50–54.
  • Morgenstern and Ortiz Morgenstern, L., and Ortiz, C. (2015). The Winograd Schema Challenge.  In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI-15), Austin, TX, USA. AAAI Press.
  • Mostafazadeh et al. Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P., and Allen, J. (2016). A Corpus and Cloze Evaluation Framework for Deeper Understanding of Commonsense Stories.  In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2016), San Diego, CA, USA. Association for Computational Linguistics.
  • Negri et al. Negri, M., Marchetti, A., Mehdad, Y., Bentivogli, L., and Giampiccolo, D. (2012). Semeval-2012 Task 8: Cross-lingual Textual Entailment for Content Synchronization.  In Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval-2012), pp. 399–407, Montréal, Canada. Association for Computational Linguistics.
  • Negri et al. Negri, M., Marchetti, A., Mehdad, Y., Bentivogli, L., and Giampiccolo, D. (2013). Semeval-2013 Task 8: Cross-lingual Textual Entailment for Content Synchronization.  In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval-2013), pp. 25–33, Atlanta, Georgia, USA. Association for Computational Linguistics.
  • Olsher Olsher, D. (2014). Semantically-based priors and nuanced knowledge core for Big Data, Social AI, and language understanding.  Neural Networks, 58, 131–147.
  • Ortiz Ortiz, C. (2016). Why We Need a Physically Embodied Turing Test and What It Might Look Like.  AI Magazine, 37(1), 55–62.
  • Ostermann et al. Ostermann, S., Modi, A., Roth, M., Thater, S., and Pinkal, M. (2018). MCScript: A Novel Dataset for Assessing Machine Comprehension Using Script Knowledge.  In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  • Papineni et al. Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: A Method for Automatic Evaluation of Machine Translation.  In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), pp. 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  • Paulheim Paulheim, H. (2018). How much is a Triple? Estimating the Cost of Knowledge Graph Creation.  In Proceedings of the 17th International Semantic Web Conference (ISWC 2018), Monterey, CA, USA. Springer.
  • Pennington et al. Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global Vectors for Word Representation.  In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), pp. 1532–1543, Doha, Qatar. Association for Computational Linguistics.
  • Peters et al. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deep Contextualized Word Representations.  In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2018), pp. 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
  • Plutchik Plutchik, R. (1980). A general psychoevolutionary theory of emotion.  In Theories of Emotion, pp. 3–31. Academic Press.
  • Pohl Pohl, A. (2012). Classifying the Wikipedia Articles into the OpenCyc Taxonomy.  In Proceedings of the Web of Linked Entities Workshop in Conjunction with the 11th International Semantic Web Conference (ISWC 2012), Boston, MA, USA.
  • Poliak et al. Poliak, A., Haldar, A., Rudinger, R., Hu, J. E., Pavlick, E., White, A. S., and Van Durme, B. (2018a). Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation.  In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), Brussels, Belgium. Association for Computational Linguistics.
  • Poliak et al. Poliak, A., Naradowsky, J., Haldar, A., Rudinger, R., and Van Durme, B. (2018b). Hypothesis Only Baselines in Natural Language Inference.  In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pp. 180–191, New Orleans, LA, USA. Association for Computational Linguistics.
  • Ponzetto and Strube Ponzetto, S. P., and Strube, M. (2007). Deriving a Large Scale Taxonomy from Wikipedia.  In Proceedings of the 22nd National Conference on Artificial Intelligence, AAAI’07, pp. 1440–1445, Vancouver, BC, Canada. AAAI Press.
  • Pradhan et al. Pradhan, S. S., Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., and Weischedel, R. (2007). OntoNotes: A Unified Relational Semantic Representation.  In International Conference on Semantic Computing (ICSC 2007), pp. 517–526.
  • Radford et al. Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018). Improving Language Understanding with Unsupervised Learning..
  • Radford et al. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners.
  • Rahman and Ng Rahman, A., and Ng, V. (2012). Resolving Complex Cases of Definite Pronouns: The Winograd Schema Challenge.  In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2012), pp. 777–789, Jeju Island, Korea. Association for Computational Linguistics.
  • Raina et al. Raina, R., Ng, A. Y., and Manning, C. D. (2005). Robust textual inference via learning and abductive reasoning.  In Proceedings of the Twentieth National Conference on Artificial Intelligence (AAAI-05), pp. 1099–1105, Pittsburgh, PA, USA. AAAI Press.
  • Rajpurkar et al. Rajpurkar, P., Jia, R., and Liang, P. (2018). Know What You Don’t Know: Unanswerable Questions for SQuAD.  In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), Melbourne, Australia. Association for Computational Linguistics.
  • Rajpurkar et al. Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). SQuAD: 100,000+ Questions for Machine Comprehension of Text.  In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), Austin, TX, USA. Association for Computational Linguistics.
  • Rashkin et al. Rashkin, H., Bosselut, A., Sap, M., Knight, K., and Choi, Y. (2018a). Modeling Naive Psychology of Characters in Simple Commonsense Stories.  In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), Melbourne, Australia. Association for Computational Linguistics.
  • Rashkin et al. Rashkin, H., Sap, M., Allaway, E., Smith, N. A., and Choi, Y. (2018b). Event2Mind: Commonsense Inference on Events, Intents, and Reactions.  In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), Melbourne, Australia. Association for Computational Linguistics.
  • Reddy et al. Reddy, S., Chen, D., and Manning, C. D. (2018). CoQA: A Conversational Question Answering Challenge.  arXiv: 1808.07042.
  • Reiss Reiss, S. (2004). Multifaceted Nature of Intrinsic Motivation: The Theory of 16 Basic Desires.  Review of General Psychology, 8(3), 179–193.
  • Richardson et al. Richardson, M., Burges, C. J. C., and Renshaw, E. (2013). MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text.  In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), p. 11, Seattle, WA, USA. Association for Computational Linguistics.
  • Rodosthenous and Michael Rodosthenous, C., and Michael, L. (2016). A Hybrid Approach to Commonsense Knowledge Acquisition.  In Proceedings of the 8th European Starting AI Researcher Symposium, The Hague, the Netherlands.
  • Roemmele et al. Roemmele, M., Bejan, C. A., and Gordon, A. S. (2011). Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning.  In AAAI Spring Symposium on Logical Formalizations of Commonsense Reasoning, p. 6, Stanford, CA, USA.
  • Roemmele and Gordon Roemmele, M., and Gordon, A. (2018). An Encoder-decoder Approach to Predicting Causal Relations in Stories.  In Proceedings of the First Workshop on Storytelling, pp. 50–59, New Orleans, LA, USA. Association for Computational Linguistics.
  • Rohrbach et al. Rohrbach, A., Torabi, A., Rohrbach, M., Tandon, N., Pal, C., Larochelle, H., Courville, A., and Schiele, B. (2017). Movie Description.  Int. J. Comput. Vision, 123(1), 94–120.
  • Rudinger et al. Rudinger, R., Naradowsky, J., Leonard, B., and Van Durme, B. (2018a). Gender Bias in Coreference Resolution.  In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2018), pp. 8–14, New Orleans, LA, USA. Association for Computational Linguistics.
  • Rudinger et al. Rudinger, R., White, A. S., and Van Durme, B. (2018b). Neural Models of Factuality.  In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2018), pp. 731–744, New Orleans, LA, USA. Association for Computational Linguistics.
  • Sap et al. Sap, M., LeBras, R., Allaway, E., Bhagavatula, C., Lourie, N., Rashkin, H., Roof, B., Smith, N. A., and Choi, Y. (2019). ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning.  In Proceedings of the Thirty-Third AAAI Conference on Artificial Inteligence (AAAI-19), Honolulu, HI, USA. AAAI Press.
  • Schuler Schuler, K. K. (2005). VerbNet: A Broad-Coverage, Comprehensive Verb Lexicon. Dissertation, University of Pennsylvania.
  • Schwartz et al. Schwartz, R., Sap, M., Konstas, I., Zilles, L., Choi, Y., and Smith, N. A. (2017). The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task.  In Proceedings of the 21st Conference on Computational Natural Language (CoNLL 2017), Vancouver, BC, Canada. Association for Computational Linguistics.
  • Sharma et al. Sharma, R., Allen, J., Bakhshandeh, O., and Mostafazadeh, N. (2018). Tackling the Story Ending Biases in The Story Cloze Test.  In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), pp. 752–757, Melbourne, Australia. Association for Computational Linguistics.
  • Singh Singh, P. (2002). The Public Acquisition of Commonsense Knowledge.  In AAAI Spring Symposium Series, Palo Alto, CA, USA.
  • Socher et al. Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C. (2013). Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank.  In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), p. 12, Seattle, WA, USA. Association for Computational Linguistics.
  • Speer et al. Speer, R., Chin, J., and Havasi, C. (2017). ConceptNet 5.5: An Open Multilingual Graph of General Knowledge..  In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) and the Twenty-Ninth Innovative Applications of Artificial Intelligence Conference (IAAI-17), San Francisco, CA, USA. AAAI Press.
  • Speer et al. Speer, R., Havasi, C., and Lieberman, H. (2008). AnalogySpace: Reducing the Dimensionality of Common Sense Knowledge.  In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (AAAI-08), Chicago, IL, USA.
  • Suchanek et al. Suchanek, F. M., Kasneci, G., and Weikum, G. (2007). YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia.  In Proceedings of the 16th International Conference on World Wide Web, p. 10, Amherst, MA, USA. ACM.
  • Suhr et al. Suhr, A., Lewis, M., Yeh, J., and Artzi, Y. (2017). A Corpus of Natural Language for Visual Reasoning.  In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), pp. 217–223. Association for Computational Linguistics.
  • Talmor et al. Talmor, A., Herzig, J., Lourie, N., and Berant, J. (2019). CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge.  In Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Minneapolis, MN, USA. Association for Computational Linguistics.
  • Tandon et al. Tandon, N., de Melo, G., Suchanek, F., and Weikum, G. (2014). WebChild: Harvesting and organizing commonsense knowledge from the web.  In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (WSDM ’14), pp. 523–532, New York, NY, USA. ACM.
  • Tandon et al. Tandon, N., de Melo, G., and Weikum, G. (2017). WebChild 2.0 : Fine-Grained Commonsense Knowledge Distillation.  In Proceedings of ACL 2017, System Demonstrations, pp. 115–120, Vancouver, BC, Canada. Association for Computational Linguistics.
  • Tandon et al. Tandon, N., Mishra, B. D., Grus, J., Yih, W.-t., Bosselut, A., and Clark, P. (2018). Reasoning about Actions and State Changes by Injecting Commonsense Knowledge.  In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), Brussels, Belgium. Association for Computational Linguistics.
  • Taylor et al. Taylor, A., Marcus, M., and Santorini, B. (2003). The Penn Treebank: An Overview.  In Treebanks, Vol. 20 of Text, Speech and Language Technology. Springer, Dordrecht.
  • Taylor Taylor, W. L. (1953). “Cloze Procedure”: A New Tool for Measuring Readability.  Journalism Bulletin, 30(4), 415–433.
  • Tjong Kim Sang and De Meulder Tjong Kim Sang, E. F., and De Meulder, F. (2003). Introduction to the CoNLL-2003 Shared Task: Language-independent Named Entity Recognition.  In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, CONLL ’03, pp. 142–147, Edmonton, AB, Canada. Association for Computational Linguistics.
  • Trinh and Le Trinh, T. H., and Le, Q. V. (2018). A Simple Method for Commonsense Reasoning.  arXiv: 1806.02847.
  • Tsuchida and Ishikawa Tsuchida, M., and Ishikawa, K. (2011). IKOMA at TAC2011: A Method for Recognizing Textual Entailment using Lexical-level and Sentence Structure-level features.  In Proceedings of the Fourth Text Analysis Conference (TAC 2011), Gaithersburg, MD, USA. National Institute of Standards and Technology.
  • Vaswani et al. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is All you Need.  In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (Eds.), Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc.
  • Wang et al. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. (2018). GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.  In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium. Association for Computational Linguistics.
  • Warstadt et al. Warstadt, A., Singh, A., and Bowman, S. R. (2018). Neural Network Acceptability Judgments.  arXiv: 1805.12471.
  • Weston et al. Weston, J., Bordes, A., Chopra, S., Rush, A. M., van Merriënboer, B., Joulin, A., and Mikolov, T. (2016). Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks.  In Proceedings of the 4th International Conference on Learning Representations (ICLR 2016), San Juan, Puerto Rico.
  • Weston et al. Weston, J., Chopra, S., and Bordes, A. (2015). Memory Networks.  In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA.
  • White et al. White, A. S., Rastogi, P., Duh, K., and Van Durme, B. (2017). Inference is Everything: Recasting Semantic Resources into a Unified Evaluation Framework.  In Proceedings of the Eighth International Joint Conference on Natural Language Processing (IJCNLP 2017), pp. 996–1005, Taipei, Taiwan. Asian Federation of Natural Language Processing.
  • White and Rawlins White, A. S., and Rawlins, K. (2018). The role of veridicality and factivity in clause selection.  In Proceedings of the 48th Annual Meeting of the North East Linguistic Society, Amherst, MA, USA. GLSA Publications.
  • Williams et al. Williams, A., Nangia, N., and Bowman, S. R. (2017). A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference.  In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2018), New Orleans, LA, USA. Association for Computational Linguistics.
  • Winograd Winograd, T. (1972). Understanding Natural Language. Academic Press, New York.
  • Wu et al. Wu, W., Li, H., Wang, H., and Zhu, K. Q. (2011). Towards a Probabilistic Taxonomy of Many Concepts.
  • Xie et al. Xie, Q., Lai, G., Dai, Z., and Hovy, E. (2017). Large-scale Cloze Test Dataset Created by Teachers.  In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), Copenhagen, Denmark. Association for Computational Linguistics.
  • Xu et al. Xu, F. F., Lin, B. Y., and Zhu, K. (2018a). Automatic Extraction of Commonsense LocatedNear Knowledge.  In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), pp. 96–101, Melbourne, Australia. Association for Computational Linguistics.
  • Xu et al. Xu, J., Zhou, H., Young, T., Zhao, H., Huang, M., and Zhu, X. (2018b). Commonsense Knowledge Aware Conversation Generation with Graph Attention.  In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI 2018), pp. 4623–4629, Stockholm, Sweden. IJCAI.
  • Yang et al. Yang, D., Lavie, A., Dyer, C., and Hovy, E. (2015). Humor Recognition and Humor Anchor Extraction.  In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2367–2376, Lisbon, Portugal. Association for Computational Linguistics.
  • Yang et al. Yang, S., Gao, Q., Sadiya, S., and Chai, J. (2018). Commonsense Justification for Action Explanation.  In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), pp. 2627–2637, Brussels, Belgium. Association for Computational Linguistics.
  • Zellers et al. Zellers, R., Bisk, Y., Farhadi, A., and Choi, Y. (2019). From Recognition to Cognition: Visual Commonsense Reasoning.  In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA. IEEE.
  • Zellers et al. Zellers, R., Bisk, Y., Schwartz, R., and Choi, Y. (2018). SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference.  In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), Brussels, Belgium. Association for Computational Linguistics.
  • Zhang et al. Zhang, S., Liu, X., Liu, J., Gao, J., Duh, K., and Van Durme, B. (2018). ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension.  arXiv: 1810.12885.
  • Zhang et al. Zhang, S., Rudinger, R., Duh, K., and Van Durme, B. (2016). Ordinal Common-sense Inference.  Transactions of the Association for Computational Linguistics, 5, 379–395.
  • Zhu et al. Zhu, Y., Groth, O., Bernstein, M., and Fei-Fei, L. (2016). Visual7W: Grounded Question Answering in Images.  In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA. IEEE.