Align, Mask and Select: A Simple Method for Incorporating Commonsense Knowledge into Language Representation Models

  • 2019-11-12 03:25:38
  • Zhi-Xiu Ye, Qian Chen, Wen Wang, Zhen-Hua Ling
  • 0

Abstract

Neural language representation models such as Bidirectional EncoderRepresentations from Transformers (BERT) pre-trained on large-scale corpora canwell capture rich semantics from plain text, and can be fine-tuned toconsistently improve the performance on various natural language processing(NLP) tasks. However, the existing pre-trained language representation modelsrarely consider explicitly incorporating commonsense knowledge or otherknowledge. In this paper, we develop a pre-training approach for incorporatingcommonsense knowledge into language representation models. We construct acommonsense-related multi-choice question answering dataset for pre-training aneural language representation model. The dataset is created automatically byour proposed "align, mask, and select" (AMS) method. We also investigatedifferent pre-training tasks. Experimental results demonstrate thatpre-training models using the proposed approach followed by fine-tuningachieves significant improvements on various commonsense-related tasks, such asCommonsenseQA and Winograd Schema Challenge, while maintaining comparableperformance on other NLP tasks, such as sentence classification and naturallanguage inference (NLI) tasks, compared to the original BERT models.

 

Quick Read (beta)

Align, Mask and Select: A Simple Method for Incorporating
Commonsense Knowledge into Language Representation Models

Zhi-Xiu Ye,1 Qian Chen,2 Wen Wang,2 Zhen-Hua Ling1
1National Engineering Laboratory for Speech and Language Information Processing,
University of Science and Technology of China
2Speech Lab, DAMO Academy, Alibaba Group
[email protected], [email protected],
[email protected], [email protected]
Work was done during an internship at DAMO Academy, Alibaba Group.
Abstract

Neural language representation models such as Bidirectional Encoder Representations from Transformers (BERT) can well capture rich language information from unlabelled text, and can be fine-tuned to benefit many NLP applications. However, the existing pre-trained language representation models rarely consider explicitly incorporating commonsense knowledge or other knowledge. We propose a pre-training approach for incorporating commonsense knowledge into language representation models. We construct a commonsense-related multi-choice question answering dataset for pre-training a neural language representation model. The dataset is created automatically by our proposed “align, mask, and select” (AMS) method. We also investigate different pre-training tasks. Experimental results demonstrate that pre-training models using the proposed approach followed by fine-tuning achieve significant improvements over previous state-of-the-art models on various commonsense-related tasks, such as CommonsenseQA and Winograd Schema Challenge. We also find that fine-tuned models after the proposed pre-training approach maintain comparable performance on other NLP tasks, such as sentence classification and natural language inference tasks, compared to the original BERT models. These results verify that the proposed approach, while significantly improving commonsense-related NLP tasks, does not degrade the language representation capabilities.

Align, Mask and Select: A Simple Method for Incorporating
Commonsense Knowledge into Language Representation Models


Zhi-Xiu Ye,1thanks: Work was done during an internship at DAMO Academy, Alibaba Group. Qian Chen,2 Wen Wang,2 Zhen-Hua Ling1 1National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China 2Speech Lab, DAMO Academy, Alibaba Group [email protected], [email protected], [email protected], [email protected]

1 Introduction

Neural language representation models, such as (??????), can well capture rich semantics from text and consistently improve the performance of various downstream natural language processing (NLP) tasks. Bidirectional Encoder Representations from Transformers (BERT) (?), as one of the most recently developed models, has produced the state-of-the-art (SOTA) results by simple fine-tuning on various NLP tasks, including natural language inference (NLI) (?), named entity recognition (NER) (?), question answering (QA) (??), text classification (?), and has achieved human-level performances on several datasets (??).

However, commonsense reasoning remains a challenging task for modern machine learning methods. For example, recently (?) proposed a commonsense-related task, CommonsenseQA, and showed that the BERT model accuracy remains dozens of points lower than the human accuracy on questions about commonsense knowledge. Some examples from CommonsenseQA are shown in Part A of Table 1. As can be seen from the examples, although it is easy for humans to answer the questions based on their knowledge about the world, it is a great challenge for machines when there is limited training data.

A) Some examples from CommonsenseQA dataset
What can eating lunch cause that is painful?
headache, gain weight, farts, bad breath, heartburn
What is the main purpose of having a bath?
cleanness, water, exfoliation, hygiene, wetness
Where could you find a shark before it was caught?
business, marine museum, pool hall, tomales bay, desert
B) Some triples from ConceptNet
(eating dinner, Causes, heartburn)
(eating dinner, MotivatedByGoal, not get headache)
(lunch, Synonym, dinner)
(have bath, HasSubevent, cleaning)
(shark, AtLocation, tomales bay)
Table 1: Some examples from the CommonsenseQA dataset shown in Part A and some related triples from ConceptNet shown in Part B. The correct answers in Part A are in boldface.

We hypothesize that exploiting knowledge graphs (KGs) representing commonsense in QA modeling may help the model choose correct answers. For example, as shown in the Part B of Table 1, some triples from ConceptNet (?) are quite related to the questions above. Exploiting these triples in QA modeling may benefit the QA models to make the correct decision.

In this paper, we propose a pre-training approach that can leverage commmonsense KGs, such as ConceptNet (?), to improve the commonsense reasoning capability of language representation models, such as BERT, without sacrificing the language representation capabilities of the models. That is, we also aim to maintain comparable performances on other NLP tasks with the original BERT models. Note that it is challenging to incorporate the commonsense knowledge into language representation models since the commonsense knowledge is usually represented in a structured format, such as (concept1, relation, concept2) in ConceptNet, which is inconsistent with the data used for pre-training language representation models. For example, BERT is pre-trained on the BooksCorpus and English Wikipedia that are composed of unstructured natural language sentences.

To tackle this challenge, inspired by the distant supervision approach (?), we propose an “align, mask and select” (AMS) method that aligns a commonsense KG with a large text corpus to construct a dataset consisting of sentences with labeled concepts. Different from the the masked language model (MLM) and next sentence prediction (NSP) tasks used for pre-training BERT models, we construct a multi-choice question answering task based on the generated dataset. We then pre-train the BERT model on this dataset with the multi-choice question answering task and fine-tune it on commonsense-related tasks, such as CommonsenseQA (?) and Winograd Schema Challenge (WSC) (?). We observe significant improvements over previous SOTA results. We also fine-tune and evaluate the pre-trained models on other NLP tasks, such as sentence classification and NLI tasks in GLUE (?), and achieve comparable performance with the original BERT models.

In summary, our contributions are threefold. First, we propose a pre-training approach for incorporating commonsense knowledge into language representation models for improving the commonsense reasoning capabilities of these models. The proposed approach is applicable to BERT and other models for the fine-tuning methods. The proposed approach is applicable to various KGs, including commonsense KGs, domain-specific KGs, etc. Second, we propose an “align, mask and select” (AMS) method, inspired by the distant supervision approaches, to automatically construct a multi-choice question answering dataset and facilitate the proposed pre-training approach. Third, experiments demonstrate that the pre-trained models from the proposed approach with fine-tuning achieve significant improvement over previous SOTA models on several commonsense-related tasks, such as CommonsenseQA and Winograd Schema Challenge, and still maintain comparable performances on several sentence classification and NLI tasks on the GLUE dataset, demonstrating that the proposed pre-training approach does not degrade the language representation capabilities of the models.

2 Related Work

2.1 Language Representation Model

Language representation models have demonstrated their effectiveness for improving many NLP tasks. These approaches can be categorized into feature-based approaches and fine-tuning approaches. The early feature-based approaches include Word2Vec (?) and Glove models (?), transforming words into distributed representations. However, these methods suffered from the insufficiency of word disambiguation. (?) further proposed Embeddings from Language Models (ELMo) that derive context-aware word vectors from a bidirectional LSTM, which is trained with a coupled language model (LM) objective on a large text corpus.

The above-mentioned feature-based approaches only use the pre-trained language representations as input features for other models. In contrast, the advantage of the fine-tuning approaches is that few parameters need to be learned from scratch, after pre-training. (?) pre-trained sentence encoders from unlabeled text and fine-tuned for a supervised downstream task. (?) developed a generative pre-trained Transformer (GPT) to learn a generative language model. (?) proposed bidirectional encoder representations from transformers (BERT), which achieved the SOTA performance on a wide variety of NLP tasks. Neither feature-based nor fine-tuning language representation models have incorporated the commonsense knowledge. In this paper, we focus on incorporating commonsense knowledge into the pre-training stage of the fine-tuning approaches.

2.2 Commonsense Reasoning

Commonsense reasoning is a challenging task for modern machine learning methods. As demonstrated in recent work (?), ensembling commonsense knowledge based models with question answering models improved the commonsense reasoning ability. An alternative direction is to directly incorporate commonsense knowledge into an unified language representation model. (?) proposed to directly pre-train BERT on commonsense knowledge triples. For any triple (concept1, relation, concept2), they took the concatenation of concept1 and relation as the question and concept2 as the correct answer. Distractors were formed by randomly picking words or phrases in ConceptNet. In this work, we also investigate directly incorporating commonsense knowledge into an unified language representation model. However, we hypothesize that the language representations learned in (?) may be tampered since the inputs to the model constructed this way are not natural language sentences. To address this issue, we propose a pre-training approach for incorporating commonsense knowledge that includes a method to construct large-scale natural language sentences. (?) collected the Common Sense Explanations (CoS-E) dataset using Amazon Mechanical Turk and applied a Commonsense Auto-Generated Explanations (CAGE) framework to language representation models, such as GPT and BERT. However, collecting this dataset used a large amount of human efforts. In contrast, we propose an “align, mask and select” (AMS) method, inspired by the distant supervision approaches, to automatically construct a multi-choice question answering dataset.

2.3 Distant Supervision

The distant supervision approach was originally proposed for generating training data for the relation classification task. The distant supervision approach (?) assumes that if two entities/concepts participate in a relation, all sentences that mention these two entities/concepts express that relation. Note that it is inevitable that there exists noise in the data labeled by distant supervision (?). In this paper, instead of employing the relation labels labeled by distant supervision, we focus on the aligned entities/concepts. We propose the AMS method to construct a multi-choice QA dataset that aligns sentences with commonsense knowledge triples, masks the aligned words (entities/concepts) in sentences and treat the masked sentences as questions, and selects several entities/concepts from knowledge graphs as distractor choices.

3 Commonsense Knowledge Base

We use ConceptNet11 1 https://github.com/commonsense/conceptnet5/wiki (?), one of the most widely used commonsense knowledge bases. ConceptNet is a semantic network that represents the large sets of words and phrases and the commonsense relationships between them (36 core relations). It contains over 21 million edges and over 8 million nodes. Its English vocabulary contains approximately 1,500,000 nodes. And it contains at least 10,000 nodes for each of the 83 languages, respectively.

Each instance in ConceptNet can be represented as a triple ri = (concept1, relation, concept2), indicating relation between the two concepts concept1 and concept2. For example, the triple (semicarbazide, IsA, chemical compound) means that “semicarbazide is a kind of chemical compounds”; the triple (cooking dinner, Causes, cooked food) means that “the effect of cooking dinner is cooked food”, etc.

(1) A triple from ConceptNet
(population, AtLocation, city)
(2) Align with the English Wikipedia dataset to obtain a sentence containing “population” and “city”
The largest city by population is Birmingham, which has long been the most industrialized city.
(3) Mask ”city” with a special token “[QW]”
The largest [QW] by population is Birmingham, which has long been the most industrialized city?
4) Select distractors by searching (population, AtLocation, ) in ConceptNet
(population, AtLocation, Michigan)
(population, AtLocation, Petrie dish)
(population, AtLocation, area with people inhabiting)
(population, AtLocation, country)
5) Generate a multi-choice question answering sample
question: The largest [QW] by population is Birmingham, which has long been the most industrialized city?
candidates: city¯, Michigan, Petrie dish, area with people inhabiting, country
Table 2: The detailed procedure of constructing a multi-choice question answering sample with the proposed AMS method. The in the fourth step is a wildcard character. The correct answer for the question is underlined.

4 Constructing Pre-training Dataset

We first filter the triples in ConceptNet as follows: (1) Filter triples in which one of the concepts is not English words. (2) Filter triples with the general relations “RelatedTo” and “IsA”, which hold a large proportion in ConceptNet. (3) Filter triples in which one of the concepts has more than four words or the edit distance between the two concepts is less than four. After filtering, we obtain 606,564 triples.

Each training sample is generated by three steps: align, mask and select, which is denoted the AMS method. Each sample in the dataset consists of a question and five candidate answers, which has the same form as the CommonsenseQA dataset. An example of constructing one training sample by masking concept2 is shown in Table 2.

Firstly, we align each triple (concept1, relation, concept2) in the filtered triple set to the English Wikipedia dataset to extract the sentences containing the two concepts. Secondly, we mask the concept1 or concept2 in one sentence with a special token [QW] and treat this sentence as a question, where QW is a replacement word of the question words “what”, “where”, etc. And the masked concept1 or concept2 is the correct answer for this question. Thirdly, for generating the distractors, (?) formed distractors by randomly picking words or phrases in ConceptNet. In our work, in order to generate more confusing distractors than the random selection approach, we select distractors sharing the same other unmasked concept, i.e., concept2 or concept1, and the same relation with the correct answer. That is to say, we search (, relation, concept2) or (concept1, relation, ) in ConceptNet to select the distractors, where is a wildcard character that can match any word or phrase. For each question, we reserve four distractors and one correct answer. If there are less than four matched distractors, we discard this question instead of complementing it with random selection. If there are more than four distractors, we randomly select four distractors from them. After applying the AMS method, we create 16,324,846 multi-choice question answering samples.

5 Pre-training BERT_CS

We investigate a multi-choice QA task for pre-training the English BERT base and BERT large models released by Google22 2 https://github.com/google-research/bert on our constructed dataset. The resulting models are denoted BERT_CSbase and BERT_CSlarge, respectively. We then evaluate the performance of fine-tuning the BERT_CS models on several NLP task (Section 6).

To reduce the large cost of training BERT_CS models from scratch, we initialize the BERT_CS models (for both BERTbase and BERTlarge models) with the parameter weights released by Google. We concatenate the question with each candidate to construct a standard input sequence for BERT_CS (i.e., “[CLS] the largest [QW] by … ? [SEP] city [SEP]”, where [CLS] and [SEP] are two special tokens), and the hidden representations over the [CLS] token are run through a softmax layer to predict whether the candidate is the correct answer.

Dataset Train Develop Test Candidates
CommonsenseQA 9741 1221 1140 5
WSC Task 1322 564 273 2
Table 3: The statistics of CommonsenseQA and Winograd Schema Challenge datasets.
Model Accuracy
BERTbase 53.0
BERTlarge 56.7
CoS-E (?) 58.2
BERT_CSbase 56.2
BERT_CSlarge 62.2
Table 4: Accuracy (%) of different models on the CommonsenseQA test set.

The objective function is defined as follows:

L=-logp(ci|s), (1)
p(ci|s)=exp(𝐰T𝐜i)k=1Nexp(𝐰T𝐜k), (2)

where ci is the correct answer, 𝐰 the parameters in the softmax layer, N the total number of candidates, and 𝐜i the vector representation of the special token [CLS]. We pre-train BERT_CS models with the batch size 160, the initial learning rate 2e-5, and the max sequence length 128 for 1 epoch. Pre-training is conducted on 16 NVIDIA V100 GPU cards with 32G memory for about 3 days for the BERT_CSlarge model and 1 day for the BERT_CSbase model.

Model WSC non-assoc. assoc. unswitched switched consist. WNLI
LM ensemble 63.7 60.6 83.8 63.4 53.4 44.3 -
Knowledge Hunter 57.1 58.3 50.0 58.8 58.8 90.1 -
BERTlarge + MTP 70.3 70.8 67.6 73.3 70.1 59.5 70.5
(?) 71.1 69.5 81.1 74.1 72.5 66.4 -
(?) 72.2 71.6 75.7 74.8 72.5 61.1 71.9
BERTlarge + MCQA 71.4 69.9 81.1 71.8 64.9 82.4 78.5
BERT_CSlarge + MCQA 75.5 73.7 86.5 74.8 73.3 86.3 83.6
Table 5: Accuracy (%) of different models on the Winograd Schema Challenge dataset together with its subsets and the WNLI test set. MTP denotes masked token prediction, which is employed in (?). MCQA denotes the multi-choice question-answering format, which is employed in this paper.
Model MNLI-(m/mm) QQP QNLI SST-2 CoLA STS-B MRPC RTE
BERTbase 84.6/83.4 71.2 90.5 93.5 52.1 85.8 88.9 66.4
BERT_triplebase 83.8/82.6 70.5 89.9 92.9 49.6 85.3 88.7 65.1
BERT_CSbase 84.7/83.9 72.1 91.2 93.6 54.3 86.4 85.9 69.5
BERTlarge 86.7/85.9 72.1 92.7 94.9 60.5 86.5 89.3 70.1
BERT_CSlarge 86.7/85.8 72.1 92.6 94.1 60.7 86.6 89.0 70.7
Table 6: The accuracy (%) of different models on the GLUE test sets. We report Matthews corr. on CoLA, Spearman corr. on STS-B, accuracy on MNLI, QNLI, SST-2 and RTE, F1-score on QQP and MRPC, which is the same as (?).

6 Experiments

In the first set of experiments, we fine-tune the BERT_CS models on several commonsense-related NLP tasks to investigate whether the proposed approach improves the commonsense reasoning capability. In the second set of experiments, we evaluate the BERT_CS models on other NLP tasks, such as text classification and natural language inference (NLI), to investigate whether the proposed approach maintains the general language representation capabilities.

Note that when fine-tuning on the commonsense-related, multi-choice QA tasks, e.g., CommonsenseQA and Winograd Schema Challenge, we fine-tune all parameters in BERT_CS, including the last softmax layer from the token [CLS]; whereas for the second set of experiments, we randomly initialize the classifier layer and train it from scratch. Additionally, as described in (?), sometimes fine-tuning on BERT is observed to be unstable on small datasets. Hence we run experiments with 5 different random seeds and select the best model based on the development set for all of the fine-tuning experiments in this section.

6.1 CommonsenseQA

The CommonsenseQA dataset consists of 12,247 questions with one correct answer and four distractor answers. This dataset consists of two splits, the question token split and the random split. Our experiments are conducted on the more challenging random split, which is the main evaluation split according to (?). The statistics of the CommonsenseQA dataset are summarized in Table 3.

Same as the pre-training stage, the input data for fine-tuning the BERT_CS models is formed by concatenating each question-answer pair as a sequence. The hidden representations over the [CLS] token are run through a softmax layer to create the predictions. The objective function is the same as Equations 1 and 2. We fine-tune the BERT_CS models on CommonsenseQA for 2 epochs with a learning rate of 1e-5 and a batch size of 16.

Table 4 shows the accuracy on the CommonsenseQA test set from the baseline BERT models, the previous SOTA model CoS-E (?), and our BERT_CS models. CoS-E model requires a large amount of human effort to collect the Common Sense Explanations (CoS-E) dataset. In comparison, our multi-choice question-answering dataset is constructed automatically. The BERT_CS models significantly outperform the baseline BERT models with BERT_CSlarge achieving a 5.5% absolute gain over the baseline BERTlarge model and a 4.0% absolute gain over the previous SOTA CoS-E model.

6.2 Winograd Schema Challenge

The Winograd Schema Challenge (WSC) (?) is introduced for testing AI agents for commonsense knowledge. The WSC consists of 273 instances of the pronoun disambiguation problem. For example, for sentence “The delivery truck zoomed by the school bus because it was going so fast.” and a corresponding question “What does the word it refers to?”, the machine is expected to answer “delivery truck” instead of “school bus”. On this task, we follow (?) and employ the WSCR dataset (?) as the extra training data. The WSCR dataset is split into a training set of 1,322 examples and a test set of 564 examples. We use the training set for fine-tuning and the test set for validating BERT_CS models, respectively, and test the fine-tuned BERT_CS models on the WSC dataset.

We transform the pronoun disambiguation problem into a multi-choice question answering problem. We mask the pronoun word with a special token [QW] to construct a question, and put the two candidate phrases as candidate answers. The remaining procedures are the same as the CommonsenseQA task. We use the same loss function as (?), that is, if c1 is correct and c2 is not, the loss is

L= -logp(c1|s)+ (3)
αmax(0,logp(c2|s)-logp(c1|s)+β),

where p(c1|s) follows Equation 2 with N=2, α and β are hyper-parameters. Similar to (?), we search α{2.5,5,10,20} and β{0.05,0.1,0.2,0.4} by optimizing the accuracy on the WSCR test set (i.e., the development set for the WSC data set). We set the batch size 16 and the learning rate 1e-5. We evaluate our models on the WSC dataset and its various partitions described in (?). We also evaluate the fine-tuned BERT_CS model (without using the WNLI training data for further fine-tuning) on the WNLI test set, one of the GLUE tasks (?). We first transform the examples in WNLI from the premise-hypothesis format into the pronoun disambiguation problem format (?) and then transform it into the multi-choice QA format.

No. Model Source Data Tasks Accuracy
1 BERT - - 58.2
2 BERT_triple ConceptNet MCQA 59.1
3 BERT_CS_random Wikipedia and ConceptNet MCQA 59.4
4 BERT_CS_MLM Wikipedia and ConceptNet MCQA+MLM 59.9
5 BERT_MLM Wikipedia and ConceptNet MLM 58.8
6 BERT_CS Wikipedia and ConceptNet MCQA 60.8
Table 7: Accuracy (%) of different models on CommonsenseQA development set. The source data and tasks are employed to pre-train BERT_CS. MCQA denotes the multi-choice question answering task and MLM denotes the masked language modeling task.

The results on the WSC dataset and its various partitions and the WNLI test set are shown in Table 5. Note that the results for (?) are fine-tuned on the whole WSCR dataset, including its training and test set partitions. Results for LM ensemble (?) and Knowledge Hunter(?) are cited from (?). Results for “BERTlarge + MTP” is taken from (?) as the baseline of applying BERT to the WSC task. As can be seen from Table 5, the “BERTlarge + MCQA” achieves better performance than “BERTlarge + MTP” on four of the seven evaluation criteria and achieves significant improvements on the assoc. and consist. partitions, which demonstrates that MCQA is a better problem formatting method than MTP for the WSC task. Also, our “BERT_CSlarge + MCQA” achieves the best performance on all of the evaluation criteria except consist.33 3 Knowledge Hunter achieves better performance on consist. by a rule-based method (?)., and achieves a 3.3% absolute improvement on the WSC dataset over the previous SOTA results from (?).

6.3 GLUE

The General Language Understanding Evaluation (GLUE) benchmark (?) is a collection of diverse natural language understanding tasks, including single-sentence tasks CoLA and SST-2, similarity and paraphrasing tasks MRPC, STS-B and QQP, and natural language inference tasks MNLI, QNLI, RTE and WNLI.

To investigate whether our multi-choice QA based pre-training approach can maintain the performance on common text classification tasks, we evaluate the BERT_CS models on 8 GLUE datasets and compare the performance with the baseline BERT models. Following (?), we use the batch size 32 and fine-tune for 3 epochs for all GLUE tasks, and select the fine-tuning learning rate among 1e-5, 2e-5, and 3e-5 based on the performance on the development set. Results are presented in Table 6. We observe that the BERT_CSlarge model achieves comparable performance with the BERTlarge model and the BERT_CSbase model achieves slightly better performance than the BERTbase model. We hypothesize that the commonsense knowledge may not be required for GLUE tasks. On the other hand, these results demonstrate that our proposed pre-training approach does not degrade the sentence representation capabilities of BERT models.

Question Candidates BERTlarge BERT_CSlarge
1) Dan had to stop Bill from toying with the injured bird. [He] is very compassionate.
A) Dan
B) Bill
B A
2) Dan had to stop Bill from toying with the injured bird. [He] is very cruel.
A) Dan
B) Bill
B B
3) The trophy doesn’t fit into the brown suitcase because [it] is too large.
A) the trophy
B) the suitcase
B B
4) The trophy doesn’t fit into the brown suitcase because [it] is too small.
A) the trophy
B) the suitcase
A A
Table 8: Several cases from the Winograd Schema Challenge dataset. The pronouns in questions are in square brackets. The correct answers and correct model predictions are in boldface.

7 Analysis

7.1 Pre-training Strategy

We conduct several experiments comparing different data and different pre-training tasks on the BERTbase model. For simplicity, we discard the subscript base in this subsection.

The first set of experiments is to compare the efficacy of our data creation approach versus the data creation approach in (?). First, same as (?), we collect 606,564 triples from ConceptNet, and construct 1,213,128 questions, each with a correct answer and four distractors. This dataset is denoted the TRIPLES dataset. We pre-train BERT models on the TRIPLES dataset with the same hyper-parameters as the BERT_CS models and the resulting model is denoted BERT_triple. We also create several model counterparts based on our constructed dataset:

  • Distractors are formed by randomly picking concept1 or concept2 in ConceptNet instead of those sharing the same concept2 or concept1 and the same relation with the correct answers. We denote the resulting model from this dataset BERT_CS_random.

  • Instead of pre-training BERT with a multi-choice QA task that chooses the correct answer from several candidate answers, we mask concept1 and concept2 and pre-train BERT with a masked language model (MLM) task. We denote the resulting model from this pre-training task BERT_MLM.

  • We randomly mask 15% WordPiece tokens (?) of the question as in (?) and then conduct both multi-choice QA task and MLM task simultaneously. The resulting model is denoted BERT_CS_MLM.

All these BERT models are fine-tuned on the CommonsenseQA dataset with the same hyper-parameters as described in Section 6.1 and the results are shown in Table 7. We observe the following from Table 7.

Comparing model 1 and model 2, we find that pre-training on ConceptNet benefits the CommonsenseQA task even with the triples as input instead of sentences. Further comparing model 2 and model 6, we find that constructing natural language sentences as input for pre-training BERT performs better on the CommonsenseQA task than pre-training using triples. We also conduct more detailed comparisons between fine-tuning model 1 and model 2 on GLUE tasks. The results are shown in Table 6. BERT_triplebase yields much worse results than BERTbase and BERT_CSbase, demonstrating that pre-training directly on triples may hurt the sentence representation capabilities of BERT.

Comparing model 3 and model 6, we find that pre-training BERT benefits from a more difficult dataset. In our selection method, all candidate answers share the same (concept1, relation) or (relation, concept2), that is, these candidates have close meanings. These more confusing candidates force BERT_CS to distinguish synonym meanings, resulting in a more powerful BERT_CS model.

Comparing model 5 and model 6, we find that the multi-choice QA task works better than the masked LM task as the pre-training task for the target multi-choice QA task. We hypothesize that, for the masked LM task, BERT is required to predict each masked wordpiece (in concepts) independently; whereas, for the multi-choice QA task, BERT is required to model the whole candidate phrases. In the latter way, BERT is able to model the whole concepts instead of paying much attention to the single wordpieces in the sentences. Comparing model 4 and model 6, we observe that adding the masked LM task hurts the performance of BERT_CS. This is probably because the masked words in questions may have a negative influence on the multi-choice QA task. Finally, our proposed model BERT_CS achieves the best performance on the CommonsenseQA development set among these different models.

Figure 1: BERT_CSbase and BERT_CSlarge accuracy on the CommonsenseQA development set against the number of pre-training steps.

7.2 Performance Curve

We investigate the performance curve on the CommonsenseQA development set from BERT_CS against the pre-training steps. For every 10,000 training steps, we save the model as the initial model for fine-tuning. For each of these models, we run experiments for 10 times with random restarts, that is, we use the same pre-trained checkpoint but perform different fine-tuning data shuffling. Due to the unstability of fine-tuning BERT (?), we remove the results that are significantly lower than the mean. In our experiments, we remove the accuracy lower than 0.57 for BERT_CSbase and 0.60 for BERT_CSlarge. The mean and standard deviation values are plotted in Figure 1. We observe that the performance of BERT_CSbase converges around 50,000 training steps and BERT_CSlarge may not have converged at the end of the pre-training stage, which demonstrates that BERT_CSlarge is more powerful at incorporating commonsense knowledge. We also compare with pre-training BERT_CS models for 2 epochs. However, pre-training for 2 epochs produces worse performance, probably due to over-fitting. Pre-training on a larger corpus (with more QA samples) may benefit the BERT_CS models and we leave this investigation to the future work.

7.3 Error Analysis

Table 8 shows several cases from the WSC dataset. Questions 1 and 2 only differ in the words “compassionate” and “cruel”. BERT_CSlarge chooses correct answers for both questions while BERTlarge chooses the same choice “Bill” for both questions. We hypothesize that BERTlarge tends to choose the closer candidates. To investigate this hypothesis, we split the WSC test set into CLOSE and FAR subsets, based on whether the correct answer is closer or farther to the pronoun in the sentence than another candidate. We find that BERT_CSlarge achieves the same 82.4% accuracy on the CLOSE set as BERTlarge and significantly better accuracy on the FAR set than BERTlarge (68.6% versus 60.6%), demonstrating that BERT_CSlarge is probably less influenced by proximity and more focused on semantics.

Questions 3 and 4 only differ in the words “large” and “small”. However, neither BERT_CSlarge nor BERTlarge chooses the correct answers. We hypothesize that since “suitcase is large” and “trophy is small” are probably quite frequent in the pre-training text, both BERTlarge and BERT_CSlarge models make mistakes. In future work, we will investigate other approaches for overcoming the sensitivity of language models and further improving commonsense reasoning.

8 Conclusion

In this paper, we develop a pre-training approach for incorporating commonsense knowledge into language representation models such as BERT. We construct a commonsense-related multi-choice question answering dataset for pre-training BERT. The dataset is created automatically by our proposed “align, mask, and select” (AMS) method. Experimental results demonstrate that pre-training using the proposed approach followed by fine-tuning significantly outperforms previous SOTA models on commonsense-related CommonsenseQA and Winograd Schema Challenge tasks, while maintaining comparable performance on other NLP tasks, such as sentence classification and NLI tasks, compared to the original BERT models. In future work, we will investigate the efficacy of the proposed approach in exploiting other structured knowledge graphs such as Freebase (?) for NLP tasks. We also plan to incorporate commonsense knowledge information into other language representation models such as the BERT model with whole word masking, XLNet (?), and RoBERTa (?).

References