Deepening Hidden Representations from Pre-trained Language Models for Natural Language Understanding

  • 2019-11-05 16:59:50
  • Junjie Yang, Hai Zhao
  • 3

Abstract

Transformer-based pre-trained language models have proven to be effective forlearning contextualized language representation. However, current approachesonly take advantage of the output of the encoder's final layer when fine-tuningthe downstream tasks. We argue that only taking single layer's output restrictsthe power of pre-trained representation. Thus we deepen the representationlearned by the model by fusing the hidden representation in terms of anexplicit HIdden Representation Extractor (HIRE), which automatically absorbsthe complementary representation with respect to the output from the finallayer. Utilizing RoBERTa as the backbone encoder, our proposed improvement overthe pre-trained models is shown effective on multiple natural languageunderstanding tasks and help our model rival with the state-of-the-art modelson the GLUE benchmark.

 

Quick Read (beta)

Deepening Hidden Representations from Pre-trained Language Models for Natural Language Understanding

Junjie Yang Hai Zhao
Abstract

Transformer-based pre-trained language models have proven to be effective for learning contextualized language representation. However, current approaches only take advantage of the output of the encoder’s final layer when fine-tuning the downstream tasks. We argue that only taking single layer’s output restricts the power of pre-trained representation. Thus we deepen the representation learned by the model by fusing the hidden representation in terms of an explicit HIdden Representation Extractor (HIRE), which automatically absorbs the complementary representation with respect to the output from the final layer. Utilizing RoBERTa as the backbone encoder, our proposed improvement over the pre-trained models is shown effective on multiple natural language understanding tasks and help our model rival with the state-of-the-art models on the GLUE benchmark.

Deepening Hidden Representations from Pre-trained Language Models for Natural Language Understanding


1 Introduction

Language representation is essential to the understanding of text. Recently, pre-training language models based on Transformer (Vaswani et al., 2017) such as GPT (Radford et al., 2018), BERT (Devlin et al., 2019), XLNet (Yang et al., 2019), and RoBERTa (Liu et al., 2019b) have been shown to be effective for learning contextualized language representation. These models have since continued to achieve new state-of-the-art results on a variety of natural processing tasks. They include question answering (Rajpurkar et al., 2018; Lai et al., 2017), natural language inference (Williams et al., 2018; Bowman et al., 2015), named entity recognition (Tjong Kim Sang and De Meulder, 2003), sentiment analysis (Socher et al., 2013) and semantic textual similarity (Cer et al., 2017; Dolan and Brockett, 2005).

Normally, Transformer-based models are pre-trained on large-scale unlabed corpus in a unspervised manner, and then fine-turned on the downstream tasks through introducing task-specific output layer. When fine-tuning on the supervised downstream tasks, the models pass directly the output of Transformer encoder’s final layer, which is consided as the contextualized representation of input text, to the task-specific layer.

However, due to the numerous layers (i.e., Transformer blocks) and considerable depth of these pre-training models, we argue that the output of the last layer may not always be the best representation of the input text during the fine-tuning for downstream task. Devlin et al. (2019) shows diverse combinations of different layers’ outputs of the pre-trained BERT result in distinct performance on CoNNL-2003 Named Entity Recognition (NER) task (Tjong Kim Sang and De Meulder, 2003). Peters et al. (2018) points out for pre-trained language models, including Transformer, the most transferable contextualized representations of input text tend to occur in the middle layers, while the top layers specialize for language modeling. Therefore, the onefold use of last layer’s output may restrict the power of the pre-trained representation.

In this paper, we introduce RTRHI: Refined Transformer Representation with Hidden Information based on the fine-tuning approach with Transformer-based model, which leverages the hidden information in the Transformer’s hiddens layer to refine the language representation. Our approch consists of two main additional components:

  1. 1.

    HIdden Representation Extractor (HIRE) dynamically learns a complementary representation which contains the information that the final layer’s output fails to capture. We put 2-layer bidirectional GRU beside the encoder to summarize the output of each layer into a single vector which will be used to compute the contribution score.

  2. 2.

    Fusion layer integrates the hidden information extracted by the HIRE with Transformer final layer’s output through two steps of different functionalities, leading to a refined contextualized language representation.

Taking advantage of the robustness of RoBERTa by using it as the Transformer-based encoder of RTRHI, we conduct experiments on GLUE benchmark (Wang et al., 2018), which consists of nine Natural Language Understanding (NLU) tasks. RTRHI outperforms our baseline model RoBERTa on 5/9 of them and advances the state-of-the-art on SST-2 dataset. Even though we don’t make any modification to the encoder’s internal architecture or redefine the pre-training procedure with different objectives or datasets, we still get the comparable performance with other state-of-the-art models on the GLUE leaderboard. These results highlight RTRHI’s excellent ability to refine Transformed-based model’s language representation.

2 Related Work

Transformer-based language models take Transformer (Vaswani et al., 2017) as their model architecture, but pre-trained with different objectives or language corpus. OpenAI GPT (Radford et al., 2018) is the first model which introduced Transformer architecture into unsupervised pre-training. The model is pre-trained on 12-layer left-to-right Transformer with BooksCorpus (Zhu et al., 2015) dataset. But instead of using a left-to-right architecture like GPT, BERT (Devlin et al., 2019) adopts Masked LM objective when pre-training, which enables the representation to incorporate context from both direction. The next sentence prediction (NSP) objective is also used by BERT to better understand the relationship between two sentences. The training procedure is conducted on a combination of BooksCorpus plus English Wikipedia. XLNet (Yang et al., 2019), as a generalized autoregressive language model, uses a permutation language modeling objective during pre-training on the other hand. In addition to BooksCorpus and English Wikipedia, it also uses Giga5, ClueWeb 2012-B and Common Crawl for pre-training. Trained with dynamic masking, large mini-batches and a larger byte-level BPE, full-sentences without NSP, RoBERTa (Liu et al., 2019b) improves BERT’s performance on the downstream tasks. The pre-training corpora includes BooksCorpus, CC-News, Openwebtext and Stories. By fine-tuning on the downstream tasks in a supervised manner, these powerful Transformer-based models all push the state-of-the-art results on the various NLP tasks to a new level.

Recent works have proposed new methods for fine-tuning the downstream tasks, including multi-task learning (Liu et al., 2019a), adversarial training (Zhu et al., 2019) or incorporating semantic information into language representation (Zhang et al., 2019b).

3 Model and Method

Figure 1: Architecture of RTRHI.

3.1 Transformer-based encoder layer

Transformer-based encoder layer is responsible for encoding input text into a sequence of high-dimensional vectors, which is consided as the contextualized representation of the input sequence. Let {w1,,wn} represent a sequence of n words of input text, we use a Transformer-based encoder to encode the input sequence, thereby to obtain its universal contextualized representation Rn×d:

R=TransformerBasedEncoder({w1,,wn}) (1)

where d is the hidden size of the encoder. It should be noted that R is the output of Transformer-based encoder’s last layer which has the same length as the input text. We call it preliminary representation in this paper to distinguish it with the one that we introduce in section 3.2. Here, we omit a rather extensive formulations of Transformer and refer readers to Radford et al. (2018), Devlin et al. (2019) and Yang et al. (2019) for more details.

3.2 Hidden Representation Extractor

Since Transformer-based encoder normally has many identical layers stacked together, for example, BERTLARGE and XLNetLARGE all contain 24 layers of the identical structure, the output of the final layer may not be the most perfect candidate to fully represent the information contained in the input text.

Trying to solve this problem, we introduce an HIdden Representation Extractor (HIRE) beside the encoder to draw from the hidden states the information that the output of the last layer fails to capture. Since each layer’s hidden states don’t carry the information of same importance to represent a certain input sequence, we adopt a mechanism which can compute the importance dynamically. We name the importance as contribution score.

The input to the HIRE is {H0,,Hj,,Hl} where 0<jl and l represents the number of layers in the encoder. Here H0 is the initial embedding of input text, which is the input of the encoder’s first layer but is updated during training and Hjn×d is the hidden-state of the encoder at the output of layer j. For the sake of simplicity, we call them all hidden-state afterwards.

For each hidden-state of encoder, we use the same 2-layer Bidirectional Gated Recurrent Unit (GRU) (Cho et al., 2014) to summarize it. Instead of taking the whole output of GRU as the representation of the hidden state, we concatenate GRU’s each layer and each direction’s final state together. In this way, we manage to summarize the hidden-state into a fixed-sized vector. Hence, we obtain U(l+1)×4d with Ui the summarized vector of Hi:

Ui=Bi-GRU(Hi)4d (2)

where 0il. Then the importance value αi for hidden-state Hi is calculated by:

αi=WTUi+b (3)

where WT4d and b are trainable parameters. Let S represent the computation scores for all hidden-states. S is computed as follows:

S=softmax(α)(l+1) (4)

It should be noted that i=0lSi=1 where Si is the weight of hidden-state i when computing the representation. Subsequently, we obtain the input sequence’s new representation A by:

A=i=0l+1SiHin×d (5)

With the same shape as the output of Transformer-based encoder’s final layer, HIRE’s output A is expected to contain the additional useful information from the encoder’s hidden-states which is helpful for a better understanding of the input text and we call it complementary representation.

Figure 2: Architecture of the Hidden Representation Extractor. The GRUs share the same parameters.

3.3 Fusion Layer

This layer fuses the information contained in the output of Tansformed-based encoder and the one extracted from encoders’ hidden states by HIRE.

Given the preliminary representation R, instead of letting it flow directly into task-specfic output layer, we combine it together with the complementary representation A to yeild M, which we define by:

M=[R;A;R+A;RA]n×4d (6)

where is elementwise multiplication (Hadamard Product) and [;] is concatenation across the last dimension.

Later, two-layer bidirectional GRU, with the output size of d for each direction, is used to fully fuse the information contained in the preliminary representation and the additional useful information included in the complementary representation. We concatenate the outputs of the GPUs in two dimensions together, and we hence obtain the final contextualized representation F of input text:

F=Bi-GRU(M)n×2d (7)

The use of GRUs enables the complete interaction between the two different kinds of information mentioned before. Therefore, F is expected to be a refined universal representation of input text.

3.4 Output layer

The output layer is task-specific, which means we can adopt HIRE and fusion layer to other downstream tasks by only changing the output layer, such as question answering.

GLUE benchmark contains two types of tasks: 1. classification; 2. regression. For classification tasks, given the input text’s contextualized representation F, following (Devlin et al., 2019), we take the first row C2d of F corresponding to the first input token ([CLS]) as the aggregate representation. Let m be the number of labels in the datasets, we pass C through a feed-forward network(FFN):

Q=W2Ttanh(W1TC+b1)+b2m (8)

with W12d×d, W2d×m, b1d and b2m the only parameters that we introduce in output layer. Finally, the probability distribution of predicted label is computed as:

p=softmax(Q)m (9)

For regression task, we obtain Q in the same manner with m=1, and take Q as the predicted value.

3.5 Training

For classification tasks, the training loss to be minimized is defined by the Cross-Entropy:

L(θ)=-1NiNcmyi,clog(pi,c) (10)

where θ is the set of all parameters in the model, N is the number of examples in the dataset, pi,c is the predicted probability of class c for example i and y is the binary indicator defined as below:

yi,c={1if label c is the correct classificationfor example i0otherwise

For regression tasks, we define the training loss by mean squared error (MSE):

L(θ)=-1NiN(Qi-yi)2 (11)

where Qi is the predicted value for example i and yi is the ground truth value for example i and N, θ are same as the ones in equation 10.

4 Experiments

4.1 Dataset

We conducted the experiments on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018) to evaluate our method’s performance. GLUE is a collection of 9 diverse datasets for training, evaluating, and analyzing natural language understanding models. Three different tasks are presented in GLUE benchmark according to the original paper:

Single-sentence tasks: The Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2018) requires the model to determine whether a sentence is grammatically acceptable; the Stanford Sentiment Treebank (SST-2) (Socher et al., 2013) is to predict the sentiment of movie reviews with label of positive or negative.
Similarity and paraphrase tasks: Similarity and paraphrase tasks are to predict whether each pair of sentences captures a paraphrase/semantic equivalence relationship. The Microsoft Research Paraphrase Corpus (MRPC) (Dolan and Brockett, 2005), the Quora Question Pairs (QQP) (Shankar et al., 2016) and the Semantic Textual Similarity Benchmark (STS-B) (Cer et al., 2017) are presented in this category.
Natural Language Inference (NLI) tasks: Natural language inference is the task of determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”. GLUE benchmark contains the following tasks: the Multi-Genre Natural Language Inference Corpus (MNLI) (Williams et al., 2018), the Stanford Question Answering Dataset (QNLI) (Rajpurkar et al., 2016), the Recognizing Textual Entailment (RTE) (Bentivogli et al., 2009) and the Winograd Schema Challenge (WNLI) (Levesque et al., 2012).
Four official metrics are adopted to evaluate the model performance: Matthews correlation (Matthews, 1975), accuracy, F1 score, Pearson and Spearman correlation coefficients. More details will be presented in section 4.3.

Single Sentence Similarity and Paraphrase Natural Language Inference
Model CoLA SST-2 MRPC QQP STS-B MNLI-m/mm QNLI RTE
(Mcc) (Acc) (Acc) (Acc) (Pearson) (Acc) (Acc) (Acc)
MT-DNN 63.5 94.3 87.5 91.9 90.7 87.1/86.7 92.9 83.4
XLNETLARGE 63.6 95.6 89.2 91.8 91.8 89.8/- 93.9 83.8
ALBERT(1.5M) 71.4 96.9 90.9 92.2 93.0 90.8/- 95.3 89.2
RoBERTa(Baseline) 68.0 96.4 90.9 92.2 92.4 90.2/90.2 94.7 86.6
RTRHI 69.6 96.8 90.9 92.0 92.2 90.6/90.4 95.0 86.6
Table 1: GLUE Dev results. RTRHI results are based on single model trained with single task and a median over five runs with different random seed but the same hyperparameter is reported for each task. The results of MT-DNN, XLNETLARGE, ALBERT(1.5M) and RoBERTa are from Liu et al. (2019a), Yang et al. (2019), Lan et al. (2019) and Liu et al. (2019b). See the lower-most row for the performance of our approach.

4.2 Implementation

Our implementation of RTRHI is based on the PyTorch implementation of Transformer 11 1 https://github.com/huggingface/transformers.

Preprocessing: Following Liu et al. (2019b), we adopt GPT-2 (Radford et al., 2019) tokenizer with a Byte-Pair Encoding (BPE) vocabulary of subword units size 50K. We format the input sequence in the following way. Given single sequence X, we add <s> token at the begining and </s> token at the end: <s>X</s>. For a pair of sequences (X,Y), we additionally use </s> to separate these two sequences: <s>X</s>Y</s>.

Model configurations: We use RoBERTa-Large as the Transformer-based encoder and load the pre-training weights of RoBERTa (Liu et al., 2019b). Like BERT-Large, RoBERTa-Large model contains 24 Transformer-blocks, with the hidden size being 1024 and the number of self-attention heads being 16 (Liu et al., 2019b; Devlin et al., 2019).

Optimization: We use Adam optimizer (Kingma and Ba, 2014) with β1=0.9, β2=0.98 and ϵ=10-6 and the learning rate is selected amongst {5e-6, 1e-5, 2e-5, 3e-5} with a warmup rate ranging from 0.06 to 0.25 depending on the nature of the task. The number of training epochs ranges from 4 to 10 with the early stop and the batch size is selected amongst {16, 32, 48}. In addition to that, we clip the gradient norm within 1 to prevent exploding gradients problem occuring in the recurrent neural networks in our model.

Regularization: We employ two types of regularization methods during training. We apply dropout (Srivastava et al., 2014) of rate 0.1 to all layers in the Transformer-based encoder and GRUs in the HIRE and fusion layer. We additionally adopt L2 weight decay of 0.1 during training.

4.3 Main results

Table 1 compares our method RTRHI with a list of Transformer-based models on the development set. To obtain a direct and fair comparison with our baseline model RoBERTa, following the original paper (Liu et al., 2019b), we fine-tune RTRHI separately for each of the GLUE tasks, using only task-specific training data. The single-model results for each task are reported. We run our model with five different random seeds but the same hyperparameters and take the median value. Due to the problematic nature of WNLI dataset, we exclude its results in this table. The results shows that RTRHI consistently outperforms RoBERTa on 4 of the GLUE task development sets, with an improvement of 1.6 points, 0.4 pionts, 0.4/0.2 points, 0.3 points on CoLA, SST-2, MNLI and QNLI respectively. And on the QQP and RTE task, our model get the same result as RoBERTa. It should be noted that the improvement is entirely attributed to the introduction of HIdden Representation Extractor and Fusion Layer in our model.

Single Sentence Similarity and Paraphrase Natural Language Inference
Model CoLA SST-2 MRPC QQP STS-B MNLI-m/mm QNLI RTE WNLI Avg
8.5k 67k 3.7k 364k 7k 393k 108k 2.5k 634
XLNet 67.8 96.8 93.0/90.7 74.2/90.3 91.6/91.1 90.2/89.7 98.6 86.3 90.4 88.4
MT-DNN 68.4 96.5 92.7/90.3 73.7/89.9 91.1/90.7 87.9/87.4 96.0 86.3 89.0 87.6
FreeLB-RoBERTa 68.0 96.8 93.1/90.8 74.8/90.3 92.4/92.2 91.1/90.7 98.8 88.7 89.0 88.8
ALICE v2 69.2 97.1 93.6/91.5 74.4/90.7 92.7/92.3 90.7/90.2 99.2 87.3 89.7 89.0
ALBERT 69.1 97.1 93.4/91.2 74.2/90.5 92.5/92.0 91.3/90.0 99.2 89.2 91.8 89.4
T5 70.8 97.1 91.9/89.2 74.6/90.4 92.5/92.1 92.0/91.7 96.7 92.5 93.2 89.7
RoBERTa(Baseline) 67.8 96.7 92.3/89.8 74.3/90.2 92.2/91.9 90.8/90.2 98.9 88.2 89.0 88.5
RTRHI 68.6 97.1 93.0/90.7 74.3/90.2 92.4/92.0 90.7/90.4 95.5 87.9 89.0 88.3
Table 2: GLUE Test results, scored by the official evaluation server. All the results are obtained from GLUE leaderboard (https://gluebenchmark.com/leaderboard) at the time of submitting RTRHI (3 November, 2019). The number below each task’s name indicates the size of training dataset. The state-of-the-art results are in bold. RTRHI takes RoBERTa as its Transformer-based encoder. Mcc, acc and pearson denote Matthews correlation, accuracy and Person correlation coefficient respectively.

Table 2 presents the results of RTRHI and other models on the test set that have been submitted to the GLUE leaderboard. Following Liu et al. (2019b), we fine-tune STS-B and MRPC starting from the MNLI single-task model. Given the simplicity between RTE, WNLI and MNLI, and the large-scale nature of MNLI dataset (393k), we also initialize RTRHI with the weights of MNLI single-task model before fine-tuning on RTE and WNLI. We submitted the ensemble-model results to the leaderboard. The results show that RTRHI still boosts the strong RoBERTa baseline model on the test set. To be specific, RTRHI outperforms RoBERTa over CoLA, SST-2, MRPC, SST-B, MNLI-mm with an improvement of 0.8 points, 0.4 points, 0.7/0.9 points, 0.2/0.1 points and 0.2 points respectively. In the meantime, RTRHI gets the same results as RoBERTa on QQP and WNLI. By category, RTRHI has better performance than RoBERTa on the single sentence tasks, similarity and paraphrase tasks. It’s worth noting that our model obtains state-of-art results on SST-2 dataset, with a score of 97.1. The results is quite promising since HIRE does not make any modification with the encoder internal architecture (Yang et al., 2019) or redefine the pre-training procedure (Liu et al., 2019b) and we still get the comparable results with them.

5 Analysis

Figure 3: Distribution of contribution scores over different layers when computing the complementary representation for various NLU tasks. The contribution scores are normalized by SoftMax to sum up to 1 for each row. The numbers on the abscissa axis indicate the corresponding layer with 0 being the first layer and 23 being the last layer.

We compare the contribution score’s distribution of different NLU tasks. For each task, we run our best single model over the development set and the results are calculated by averaging the values across all the examples within each dataset. The results are showed in Figure 3. From the top to the bottom of the headmap, the results are placed in the following order: single-sentence tasks, similarity and paraphrase tasks and natural language inference tasks. From figure 3, we find that the distribution differs among the different tasks, which demonstrates RTRHI’s dynamic ability to adapt for distinct task when computing the complementary representation. The most important contribution occurs below the final layer for all the tasks except MRPC and RTE. All layers have a close contribution for MRPC and RTE task.

Figure 4 presents the distribution of contribution scores over different layers for each example of SST-2 dataset. The number on the ordinate axis denotes the index of the example. We observe that even though there are subtle differences among these example, they follow certain same patterns when calculating the complementary representation, for example, layer 21 and 22 contribute the most for almost all the examples and also the layers around them. But the figure shows also that for some examples, all layers contribute almost equally.

Figure 4: Distribution of contribution scores over different layers for each example of SST-2 dataset. The number on the ordinate axis denotes the index of the example.

6 Conclusion

In this paper, we have introduced RTRHI, a novel approach that refines language representation by leveraging the Transformer-based model’s hidden layers. Specifically, an HIdden Representation Extractor is used to dynamically generate complementary imformation which will be incorporated with preliminary representation in the Fusion Layer. The experimental results demonstrate the effectiveness of refined language representation for natural language understanding. The analysis highlights the distinct contribution of each layer’s output for diverse task and different example. We expect future work could be conducted in the following domains: (1) explore sparse version of Hidden Representation Extractor for more effective computation and less memory usage; (2) incorporating extra knowledge information (Zhang et al., 2019a) or structured semantic information (Zhang et al., 2019b) with current language representation in the fusion layer during fine-tuning; (3) integrate multi-tasks training (Caruana, 1997) or knowledge distillation (Buciluǎ et al., 2006; Hinton et al., 2015) into our model.

References

  • Bentivogli et al. (2009) Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. 2009. The fifth pascal recognizing textual entailment challenge. In TAC.
  • Bowman et al. (2015) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
  • Buciluǎ et al. (2006) Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541. ACM.
  • Caruana (1997) Rich Caruana. 1997. Multitask learning. Machine learning, 28(1):41–75.
  • Cer et al. (2017) Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. 2017. Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Dolan and Brockett (2005) William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).
  • Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Lai et al. (2017) Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683.
  • Lan et al. (2019) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
  • Levesque et al. (2012) Hector Levesque, Ernest Davis, and Leora Morgenstern. 2012. The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning.
  • Liu et al. (2019a) Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019a. Multi-task deep neural networks for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4487–4496, Florence, Italy. Association for Computational Linguistics.
  • Liu et al. (2019b) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  • Matthews (1975) Brian W Matthews. 1975. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2):442–451.
  • Peters et al. (2018) Matthew Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. 2018. Dissecting contextual word embeddings: Architecture and representation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1499–1509, Brussels, Belgium. Association for Computational Linguistics.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. Technical report, OpenAI.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Dario Amodei, Daniela Amodei, Jack Clark, Miles Brundage, and Ilya Sutskever. 2019. Better language models and their implications. Technical report, OpenAI.
  • Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Association for Computational Linguistics.
  • Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.
  • Shankar et al. (2016) Iyer Shankar, Dandekar Nikhil, and Csernai Kornl. 2016. First quora dataset release: Question pairs.
  • Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958.
  • Tjong Kim Sang and De Meulder (2003) Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
  • Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. EMNLP 2018, page 353.
  • Warstadt et al. (2018) Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. 2018. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471.
  • Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122.
  • Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
  • Zhang et al. (2019a) Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019a. ERNIE: Enhanced language representation with informative entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1441–1451, Florence, Italy. Association for Computational Linguistics.
  • Zhang et al. (2019b) Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li, Shuailiang Zhang, Xi Zhou, and Xiang Zhou. 2019b. Semantics-aware bert for language understanding. arXiv preprint arXiv:1909.02209.
  • Zhu et al. (2019) Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Thomas Goldstein, and Jingjing Liu. 2019. Freelb: Enhanced adversarial training for language understanding. arXiv preprint arXiv:1909.11764.
  • Zhu et al. (2015) Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27.