Neural language representation models such as Bidirectional EncoderRepresentations from Transformers (BERT) pre-trained on large-scale corpora canwell capture rich semantics from plain text, and can be fine-tuned toconsistently improve the performance on various natural language processing(NLP) tasks. However, the existing pre-trained language representation modelsrarely consider explicitly incorporating commonsense knowledge or otherknowledge. In this paper, we develop a pre-training approach for incorporatingcommonsense knowledge into language representation models. We construct acommonsense-related multi-choice question answering dataset for pre-training aneural language representation model. The dataset is created automatically byour proposed "align, mask, and select" (AMS) method. We also investigatedifferent pre-training tasks. Experimental results demonstrate thatpre-training models using the proposed approach followed by fine-tuningachieves significant improvements on various commonsense-related tasks, such asCommonsenseQA and Winograd Schema Challenge, while maintaining comparableperformance on other NLP tasks, such as sentence classification and naturallanguage inference (NLI) tasks, compared to the original BERT models.
Quick Read (beta)
Align, Mask and Select: A Simple Method for Incorporating
Commonsense Knowledge into Language Representation Models
Neural language representation models such as Bidirectional Encoder Representations from Transformers (BERT) can well capture rich language information from unlabelled text, and can be fine-tuned to benefit many NLP applications. However, the existing pre-trained language representation models rarely consider explicitly incorporating commonsense knowledge or other knowledge. We propose a pre-training approach for incorporating commonsense knowledge into language representation models. We construct a commonsense-related multi-choice question answering dataset for pre-training a neural language representation model. The dataset is created automatically by our proposed “align, mask, and select” (AMS) method. We also investigate different pre-training tasks. Experimental results demonstrate that pre-training models using the proposed approach followed by fine-tuning achieve significant improvements over previous state-of-the-art models on various commonsense-related tasks, such as CommonsenseQA and Winograd Schema Challenge. We also find that fine-tuned models after the proposed pre-training approach maintain comparable performance on other NLP tasks, such as sentence classification and natural language inference tasks, compared to the original BERT models. These results verify that the proposed approach, while significantly improving commonsense-related NLP tasks, does not degrade the language representation capabilities.
Align, Mask and Select: A Simple Method for Incorporating
Commonsense Knowledge into Language Representation Models
Zhi-Xiu Ye,1††thanks: Work was done during an internship at DAMO Academy, Alibaba Group. Qian Chen,2 Wen Wang,2 Zhen-Hua Ling1 1National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China 2Speech Lab, DAMO Academy, Alibaba Group [email protected], [email protected], [email protected], [email protected]
Neural language representation models, such as (?; ?; ?; ?; ?; ?), can well capture rich semantics from text and consistently improve the performance of various downstream natural language processing (NLP) tasks. Bidirectional Encoder Representations from Transformers (BERT) (?), as one of the most recently developed models, has produced the state-of-the-art (SOTA) results by simple fine-tuning on various NLP tasks, including natural language inference (NLI) (?), named entity recognition (NER) (?), question answering (QA) (?; ?), text classification (?), and has achieved human-level performances on several datasets (?; ?).
However, commonsense reasoning remains a challenging task for modern machine learning methods. For example, recently (?) proposed a commonsense-related task, CommonsenseQA, and showed that the BERT model accuracy remains dozens of points lower than the human accuracy on questions about commonsense knowledge. Some examples from CommonsenseQA are shown in Part A of Table 1. As can be seen from the examples, although it is easy for humans to answer the questions based on their knowledge about the world, it is a great challenge for machines when there is limited training data.
|A) Some examples from CommonsenseQA dataset|
|What can eating lunch cause that is painful?|
|headache, gain weight, farts, bad breath, heartburn|
|What is the main purpose of having a bath?|
|cleanness, water, exfoliation, hygiene, wetness|
|Where could you find a shark before it was caught?|
|business, marine museum, pool hall, tomales bay, desert|
|B) Some triples from ConceptNet|
|(eating dinner, Causes, heartburn)|
|(eating dinner, MotivatedByGoal, not get headache)|
|(lunch, Synonym, dinner)|
|(have bath, HasSubevent, cleaning)|
|(shark, AtLocation, tomales bay)|
We hypothesize that exploiting knowledge graphs (KGs) representing commonsense in QA modeling may help the model choose correct answers. For example, as shown in the Part B of Table 1, some triples from ConceptNet (?) are quite related to the questions above. Exploiting these triples in QA modeling may benefit the QA models to make the correct decision.
In this paper, we propose a pre-training approach that can leverage commmonsense KGs, such as ConceptNet (?), to improve the commonsense reasoning capability of language representation models, such as BERT, without sacrificing the language representation capabilities of the models. That is, we also aim to maintain comparable performances on other NLP tasks with the original BERT models. Note that it is challenging to incorporate the commonsense knowledge into language representation models since the commonsense knowledge is usually represented in a structured format, such as (concept, relation, concept) in ConceptNet, which is inconsistent with the data used for pre-training language representation models. For example, BERT is pre-trained on the BooksCorpus and English Wikipedia that are composed of unstructured natural language sentences.
To tackle this challenge, inspired by the distant supervision approach (?), we propose an “align, mask and select” (AMS) method that aligns a commonsense KG with a large text corpus to construct a dataset consisting of sentences with labeled concepts. Different from the the masked language model (MLM) and next sentence prediction (NSP) tasks used for pre-training BERT models, we construct a multi-choice question answering task based on the generated dataset. We then pre-train the BERT model on this dataset with the multi-choice question answering task and fine-tune it on commonsense-related tasks, such as CommonsenseQA (?) and Winograd Schema Challenge (WSC) (?). We observe significant improvements over previous SOTA results. We also fine-tune and evaluate the pre-trained models on other NLP tasks, such as sentence classification and NLI tasks in GLUE (?), and achieve comparable performance with the original BERT models.
In summary, our contributions are threefold. First, we propose a pre-training approach for incorporating commonsense knowledge into language representation models for improving the commonsense reasoning capabilities of these models. The proposed approach is applicable to BERT and other models for the fine-tuning methods. The proposed approach is applicable to various KGs, including commonsense KGs, domain-specific KGs, etc. Second, we propose an “align, mask and select” (AMS) method, inspired by the distant supervision approaches, to automatically construct a multi-choice question answering dataset and facilitate the proposed pre-training approach. Third, experiments demonstrate that the pre-trained models from the proposed approach with fine-tuning achieve significant improvement over previous SOTA models on several commonsense-related tasks, such as CommonsenseQA and Winograd Schema Challenge, and still maintain comparable performances on several sentence classification and NLI tasks on the GLUE dataset, demonstrating that the proposed pre-training approach does not degrade the language representation capabilities of the models.
2 Related Work
2.1 Language Representation Model
Language representation models have demonstrated their effectiveness for improving many NLP tasks. These approaches can be categorized into feature-based approaches and fine-tuning approaches. The early feature-based approaches include Word2Vec (?) and Glove models (?), transforming words into distributed representations. However, these methods suffered from the insufficiency of word disambiguation. (?) further proposed Embeddings from Language Models (ELMo) that derive context-aware word vectors from a bidirectional LSTM, which is trained with a coupled language model (LM) objective on a large text corpus.
The above-mentioned feature-based approaches only use the pre-trained language representations as input features for other models. In contrast, the advantage of the fine-tuning approaches is that few parameters need to be learned from scratch, after pre-training. (?) pre-trained sentence encoders from unlabeled text and fine-tuned for a supervised downstream task. (?) developed a generative pre-trained Transformer (GPT) to learn a generative language model. (?) proposed bidirectional encoder representations from transformers (BERT), which achieved the SOTA performance on a wide variety of NLP tasks. Neither feature-based nor fine-tuning language representation models have incorporated the commonsense knowledge. In this paper, we focus on incorporating commonsense knowledge into the pre-training stage of the fine-tuning approaches.
2.2 Commonsense Reasoning
Commonsense reasoning is a challenging task for modern machine learning methods. As demonstrated in recent work (?), ensembling commonsense knowledge based models with question answering models improved the commonsense reasoning ability. An alternative direction is to directly incorporate commonsense knowledge into an unified language representation model. (?) proposed to directly pre-train BERT on commonsense knowledge triples. For any triple (concept, relation, concept), they took the concatenation of concept and relation as the question and concept as the correct answer. Distractors were formed by randomly picking words or phrases in ConceptNet. In this work, we also investigate directly incorporating commonsense knowledge into an unified language representation model. However, we hypothesize that the language representations learned in (?) may be tampered since the inputs to the model constructed this way are not natural language sentences. To address this issue, we propose a pre-training approach for incorporating commonsense knowledge that includes a method to construct large-scale natural language sentences. (?) collected the Common Sense Explanations (CoS-E) dataset using Amazon Mechanical Turk and applied a Commonsense Auto-Generated Explanations (CAGE) framework to language representation models, such as GPT and BERT. However, collecting this dataset used a large amount of human efforts. In contrast, we propose an “align, mask and select” (AMS) method, inspired by the distant supervision approaches, to automatically construct a multi-choice question answering dataset.
2.3 Distant Supervision
The distant supervision approach was originally proposed for generating training data for the relation classification task. The distant supervision approach (?) assumes that if two entities/concepts participate in a relation, all sentences that mention these two entities/concepts express that relation. Note that it is inevitable that there exists noise in the data labeled by distant supervision (?). In this paper, instead of employing the relation labels labeled by distant supervision, we focus on the aligned entities/concepts. We propose the AMS method to construct a multi-choice QA dataset that aligns sentences with commonsense knowledge triples, masks the aligned words (entities/concepts) in sentences and treat the masked sentences as questions, and selects several entities/concepts from knowledge graphs as distractor choices.
3 Commonsense Knowledge Base
We use ConceptNet11 1 https://github.com/commonsense/conceptnet5/wiki (?), one of the most widely used commonsense knowledge bases. ConceptNet is a semantic network that represents the large sets of words and phrases and the commonsense relationships between them (36 core relations). It contains over 21 million edges and over 8 million nodes. Its English vocabulary contains approximately 1,500,000 nodes. And it contains at least 10,000 nodes for each of the 83 languages, respectively.
Each instance in ConceptNet can be represented as a triple = (concept, relation, concept), indicating relation between the two concepts concept and concept. For example, the triple (semicarbazide, IsA, chemical compound) means that “semicarbazide is a kind of chemical compounds”; the triple (cooking dinner, Causes, cooked food) means that “the effect of cooking dinner is cooked food”, etc.
|(1) A triple from ConceptNet|
|(population, AtLocation, city)|
|(2) Align with the English Wikipedia dataset to obtain a sentence containing “population” and “city”|
|The largest city by population is Birmingham, which has long been the most industrialized city.|
|(3) Mask ”city” with a special token “[QW]”|
|The largest [QW] by population is Birmingham, which has long been the most industrialized city?|
|4) Select distractors by searching (population, AtLocation, ) in ConceptNet|
|(population, AtLocation, Michigan)|
|(population, AtLocation, Petrie dish)|
|(population, AtLocation, area with people inhabiting)|
|(population, AtLocation, country)|
|5) Generate a multi-choice question answering sample|
|question: The largest [QW] by population is Birmingham, which has long been the most industrialized city?|
|candidates: , Michigan, Petrie dish, area with people inhabiting, country|
4 Constructing Pre-training Dataset
We first filter the triples in ConceptNet as follows: (1) Filter triples in which one of the concepts is not English words. (2) Filter triples with the general relations “RelatedTo” and “IsA”, which hold a large proportion in ConceptNet. (3) Filter triples in which one of the concepts has more than four words or the edit distance between the two concepts is less than four. After filtering, we obtain 606,564 triples.
Each training sample is generated by three steps: align, mask and select, which is denoted the AMS method. Each sample in the dataset consists of a question and five candidate answers, which has the same form as the CommonsenseQA dataset. An example of constructing one training sample by masking concept is shown in Table 2.
Firstly, we align each triple (concept, relation, concept) in the filtered triple set to the English Wikipedia dataset to extract the sentences containing the two concepts. Secondly, we mask the concept or concept in one sentence with a special token [QW] and treat this sentence as a question, where QW is a replacement word of the question words “what”, “where”, etc. And the masked concept or concept is the correct answer for this question. Thirdly, for generating the distractors, (?) formed distractors by randomly picking words or phrases in ConceptNet. In our work, in order to generate more confusing distractors than the random selection approach, we select distractors sharing the same other unmasked concept, i.e., concept or concept, and the same relation with the correct answer. That is to say, we search (, relation, concept) or (concept, relation, ) in ConceptNet to select the distractors, where is a wildcard character that can match any word or phrase. For each question, we reserve four distractors and one correct answer. If there are less than four matched distractors, we discard this question instead of complementing it with random selection. If there are more than four distractors, we randomly select four distractors from them. After applying the AMS method, we create 16,324,846 multi-choice question answering samples.
5 Pre-training BERT_CS
We investigate a multi-choice QA task for pre-training the English BERT base and BERT large models released by Google22 2 https://github.com/google-research/bert on our constructed dataset. The resulting models are denoted BERT_CS and BERT_CS, respectively. We then evaluate the performance of fine-tuning the BERT_CS models on several NLP task (Section 6).
To reduce the large cost of training BERT_CS models from scratch, we initialize the BERT_CS models (for both BERT and BERT models) with the parameter weights released by Google. We concatenate the question with each candidate to construct a standard input sequence for BERT_CS (i.e., “[CLS] the largest [QW] by … ? [SEP] city [SEP]”, where [CLS] and [SEP] are two special tokens), and the hidden representations over the [CLS] token are run through a softmax layer to predict whether the candidate is the correct answer.
The objective function is defined as follows:
where is the correct answer, the parameters in the softmax layer, N the total number of candidates, and the vector representation of the special token [CLS]. We pre-train BERT_CS models with the batch size 160, the initial learning rate , and the max sequence length 128 for 1 epoch. Pre-training is conducted on 16 NVIDIA V100 GPU cards with 32G memory for about 3 days for the BERT_CS model and 1 day for the BERT_CS model.
|BERT + MTP||70.3||70.8||67.6||73.3||70.1||59.5||70.5|
|BERT + MCQA||71.4||69.9||81.1||71.8||64.9||82.4||78.5|
|BERT_CS + MCQA||75.5||73.7||86.5||74.8||73.3||86.3||83.6|
In the first set of experiments, we fine-tune the BERT_CS models on several commonsense-related NLP tasks to investigate whether the proposed approach improves the commonsense reasoning capability. In the second set of experiments, we evaluate the BERT_CS models on other NLP tasks, such as text classification and natural language inference (NLI), to investigate whether the proposed approach maintains the general language representation capabilities.
Note that when fine-tuning on the commonsense-related, multi-choice QA tasks, e.g., CommonsenseQA and Winograd Schema Challenge, we fine-tune all parameters in BERT_CS, including the last softmax layer from the token [CLS]; whereas for the second set of experiments, we randomly initialize the classifier layer and train it from scratch. Additionally, as described in (?), sometimes fine-tuning on BERT is observed to be unstable on small datasets. Hence we run experiments with 5 different random seeds and select the best model based on the development set for all of the fine-tuning experiments in this section.
The CommonsenseQA dataset consists of 12,247 questions with one correct answer and four distractor answers. This dataset consists of two splits, the question token split and the random split. Our experiments are conducted on the more challenging random split, which is the main evaluation split according to (?). The statistics of the CommonsenseQA dataset are summarized in Table 3.
Same as the pre-training stage, the input data for fine-tuning the BERT_CS models is formed by concatenating each question-answer pair as a sequence. The hidden representations over the [CLS] token are run through a softmax layer to create the predictions. The objective function is the same as Equations 1 and 2. We fine-tune the BERT_CS models on CommonsenseQA for 2 epochs with a learning rate of 1e-5 and a batch size of 16.
Table 4 shows the accuracy on the CommonsenseQA test set from the baseline BERT models, the previous SOTA model CoS-E (?), and our BERT_CS models. CoS-E model requires a large amount of human effort to collect the Common Sense Explanations (CoS-E) dataset. In comparison, our multi-choice question-answering dataset is constructed automatically. The BERT_CS models significantly outperform the baseline BERT models with BERT_CS achieving a 5.5% absolute gain over the baseline BERT model and a 4.0% absolute gain over the previous SOTA CoS-E model.
6.2 Winograd Schema Challenge
The Winograd Schema Challenge (WSC) (?) is introduced for testing AI agents for commonsense knowledge. The WSC consists of 273 instances of the pronoun disambiguation problem. For example, for sentence “The delivery truck zoomed by the school bus because it was going so fast.” and a corresponding question “What does the word it refers to?”, the machine is expected to answer “delivery truck” instead of “school bus”. On this task, we follow (?) and employ the WSCR dataset (?) as the extra training data. The WSCR dataset is split into a training set of 1,322 examples and a test set of 564 examples. We use the training set for fine-tuning and the test set for validating BERT_CS models, respectively, and test the fine-tuned BERT_CS models on the WSC dataset.
We transform the pronoun disambiguation problem into a multi-choice question answering problem. We mask the pronoun word with a special token [QW] to construct a question, and put the two candidate phrases as candidate answers. The remaining procedures are the same as the CommonsenseQA task. We use the same loss function as (?), that is, if c is correct and c is not, the loss is
where follows Equation 2 with , and are hyper-parameters. Similar to (?), we search and by optimizing the accuracy on the WSCR test set (i.e., the development set for the WSC data set). We set the batch size 16 and the learning rate . We evaluate our models on the WSC dataset and its various partitions described in (?). We also evaluate the fine-tuned BERT_CS model (without using the WNLI training data for further fine-tuning) on the WNLI test set, one of the GLUE tasks (?). We first transform the examples in WNLI from the premise-hypothesis format into the pronoun disambiguation problem format (?) and then transform it into the multi-choice QA format.
|3||BERT_CS_random||Wikipedia and ConceptNet||MCQA||59.4|
|4||BERT_CS_MLM||Wikipedia and ConceptNet||MCQA+MLM||59.9|
|5||BERT_MLM||Wikipedia and ConceptNet||MLM||58.8|
|6||BERT_CS||Wikipedia and ConceptNet||MCQA||60.8|
The results on the WSC dataset and its various partitions and the WNLI test set are shown in Table 5. Note that the results for (?) are fine-tuned on the whole WSCR dataset, including its training and test set partitions. Results for LM ensemble (?) and Knowledge Hunter(?) are cited from (?). Results for “BERT + MTP” is taken from (?) as the baseline of applying BERT to the WSC task. As can be seen from Table 5, the “BERT + MCQA” achieves better performance than “BERT + MTP” on four of the seven evaluation criteria and achieves significant improvements on the assoc. and consist. partitions, which demonstrates that MCQA is a better problem formatting method than MTP for the WSC task. Also, our “BERT_CS + MCQA” achieves the best performance on all of the evaluation criteria except consist.33 3 Knowledge Hunter achieves better performance on consist. by a rule-based method (?)., and achieves a 3.3% absolute improvement on the WSC dataset over the previous SOTA results from (?).
The General Language Understanding Evaluation (GLUE) benchmark (?) is a collection of diverse natural language understanding tasks, including single-sentence tasks CoLA and SST-2, similarity and paraphrasing tasks MRPC, STS-B and QQP, and natural language inference tasks MNLI, QNLI, RTE and WNLI.
To investigate whether our multi-choice QA based pre-training approach can maintain the performance on common text classification tasks, we evaluate the BERT_CS models on 8 GLUE datasets and compare the performance with the baseline BERT models. Following (?), we use the batch size 32 and fine-tune for 3 epochs for all GLUE tasks, and select the fine-tuning learning rate among 1e-5, 2e-5, and 3e-5 based on the performance on the development set. Results are presented in Table 6. We observe that the BERT_CS model achieves comparable performance with the BERT model and the BERT_CS model achieves slightly better performance than the BERT model. We hypothesize that the commonsense knowledge may not be required for GLUE tasks. On the other hand, these results demonstrate that our proposed pre-training approach does not degrade the sentence representation capabilities of BERT models.
|1) Dan had to stop Bill from toying with the injured bird. [He] is very compassionate.||
|2) Dan had to stop Bill from toying with the injured bird. [He] is very cruel.||
|3) The trophy doesn’t fit into the brown suitcase because [it] is too large.||
|4) The trophy doesn’t fit into the brown suitcase because [it] is too small.||
7.1 Pre-training Strategy
We conduct several experiments comparing different data and different pre-training tasks on the BERT model. For simplicity, we discard the subscript in this subsection.
The first set of experiments is to compare the efficacy of our data creation approach versus the data creation approach in (?). First, same as (?), we collect 606,564 triples from ConceptNet, and construct 1,213,128 questions, each with a correct answer and four distractors. This dataset is denoted the TRIPLES dataset. We pre-train BERT models on the TRIPLES dataset with the same hyper-parameters as the BERT_CS models and the resulting model is denoted BERT_triple. We also create several model counterparts based on our constructed dataset:
Distractors are formed by randomly picking concept or concept in ConceptNet instead of those sharing the same concept or concept and the same relation with the correct answers. We denote the resulting model from this dataset BERT_CS_random.
Instead of pre-training BERT with a multi-choice QA task that chooses the correct answer from several candidate answers, we mask concept and concept and pre-train BERT with a masked language model (MLM) task. We denote the resulting model from this pre-training task BERT_MLM.
We randomly mask 15% WordPiece tokens (?) of the question as in (?) and then conduct both multi-choice QA task and MLM task simultaneously. The resulting model is denoted BERT_CS_MLM.
All these BERT models are fine-tuned on the CommonsenseQA dataset with the same hyper-parameters as described in Section 6.1 and the results are shown in Table 7. We observe the following from Table 7.
Comparing model 1 and model 2, we find that pre-training on ConceptNet benefits the CommonsenseQA task even with the triples as input instead of sentences. Further comparing model 2 and model 6, we find that constructing natural language sentences as input for pre-training BERT performs better on the CommonsenseQA task than pre-training using triples. We also conduct more detailed comparisons between fine-tuning model 1 and model 2 on GLUE tasks. The results are shown in Table 6. BERT_triple yields much worse results than BERT and BERT_CS, demonstrating that pre-training directly on triples may hurt the sentence representation capabilities of BERT.
Comparing model 3 and model 6, we find that pre-training BERT benefits from a more difficult dataset. In our selection method, all candidate answers share the same (concept, relation) or (relation, concept), that is, these candidates have close meanings. These more confusing candidates force BERT_CS to distinguish synonym meanings, resulting in a more powerful BERT_CS model.
Comparing model 5 and model 6, we find that the multi-choice QA task works better than the masked LM task as the pre-training task for the target multi-choice QA task. We hypothesize that, for the masked LM task, BERT is required to predict each masked wordpiece (in concepts) independently; whereas, for the multi-choice QA task, BERT is required to model the whole candidate phrases. In the latter way, BERT is able to model the whole concepts instead of paying much attention to the single wordpieces in the sentences. Comparing model 4 and model 6, we observe that adding the masked LM task hurts the performance of BERT_CS. This is probably because the masked words in questions may have a negative influence on the multi-choice QA task. Finally, our proposed model BERT_CS achieves the best performance on the CommonsenseQA development set among these different models.
7.2 Performance Curve
We investigate the performance curve on the CommonsenseQA development set from BERT_CS against the pre-training steps. For every 10,000 training steps, we save the model as the initial model for fine-tuning. For each of these models, we run experiments for 10 times with random restarts, that is, we use the same pre-trained checkpoint but perform different fine-tuning data shuffling. Due to the unstability of fine-tuning BERT (?), we remove the results that are significantly lower than the mean. In our experiments, we remove the accuracy lower than 0.57 for BERT_CS and 0.60 for BERT_CS. The mean and standard deviation values are plotted in Figure 1. We observe that the performance of BERT_CS converges around 50,000 training steps and BERT_CS may not have converged at the end of the pre-training stage, which demonstrates that BERT_CS is more powerful at incorporating commonsense knowledge. We also compare with pre-training BERT_CS models for 2 epochs. However, pre-training for 2 epochs produces worse performance, probably due to over-fitting. Pre-training on a larger corpus (with more QA samples) may benefit the BERT_CS models and we leave this investigation to the future work.
7.3 Error Analysis
Table 8 shows several cases from the WSC dataset. Questions 1 and 2 only differ in the words “compassionate” and “cruel”. BERT_CS chooses correct answers for both questions while BERT chooses the same choice “Bill” for both questions. We hypothesize that BERT tends to choose the closer candidates. To investigate this hypothesis, we split the WSC test set into CLOSE and FAR subsets, based on whether the correct answer is closer or farther to the pronoun in the sentence than another candidate. We find that BERT_CS achieves the same 82.4% accuracy on the CLOSE set as BERT and significantly better accuracy on the FAR set than BERT (68.6% versus 60.6%), demonstrating that BERT_CS is probably less influenced by proximity and more focused on semantics.
Questions 3 and 4 only differ in the words “large” and “small”. However, neither BERT_CS nor BERT chooses the correct answers. We hypothesize that since “suitcase is large” and “trophy is small” are probably quite frequent in the pre-training text, both BERT and BERT_CS models make mistakes. In future work, we will investigate other approaches for overcoming the sensitivity of language models and further improving commonsense reasoning.
In this paper, we develop a pre-training approach for incorporating commonsense knowledge into language representation models such as BERT. We construct a commonsense-related multi-choice question answering dataset for pre-training BERT. The dataset is created automatically by our proposed “align, mask, and select” (AMS) method. Experimental results demonstrate that pre-training using the proposed approach followed by fine-tuning significantly outperforms previous SOTA models on commonsense-related CommonsenseQA and Winograd Schema Challenge tasks, while maintaining comparable performance on other NLP tasks, such as sentence classification and NLI tasks, compared to the original BERT models. In future work, we will investigate the efficacy of the proposed approach in exploiting other structured knowledge graphs such as Freebase (?) for NLP tasks. We also plan to incorporate commonsense knowledge information into other language representation models such as the BERT model with whole word masking, XLNet (?), and RoBERTa (?).