Abstract
Reading comprehension (RC) has been studied in a variety of datasets with theboosted performance brought by deep neural networks. However, thegeneralization capability of these models across different domains remainsunclear. To alleviate this issue, we are going to investigate unsuperviseddomain adaptation on RC, wherein a model is trained on labeled source domainand to be applied to the target domain with only unlabeled samples. We firstshow that even with the powerful BERT contextual representation, theperformance is still unsatisfactory when the model trained on one dataset isdirectly applied to another target dataset. To solve this, we provide a novelconditional adversarial selftraining method (CASe). Specifically, our approachleverages a BERT model finetuned on the source dataset along with theconfidence filtering to generate reliable pseudolabeled samples in the targetdomain for selftraining. On the other hand, it further reduces domaindistribution discrepancy through conditional adversarial learning acrossdomains. Extensive experiments show our approach achieves comparable accuracyto supervised models on multiple largescale benchmark datasets.
Quick Read (beta)
Unsupervised Domain Adaptation on Reading Comprehension
Abstract
Reading comprehension (RC) has been studied in a variety of datasets with the boosted performance brought by deep neural networks. However, the generalization capability of these models across different domains remains unclear. To alleviate this issue, we are going to investigate unsupervised domain adaptation on RC, wherein a model is trained on labeled source domain and to be applied to the target domain with only unlabeled samples. We first show that even with the powerful BERT contextual representation, the performance is still unsatisfactory when the model trained on one dataset is directly applied to another target dataset. To solve this, we provide a novel conditional adversarial selftraining method (CASe). Specifically, our approach leverages a BERT model finetuned on the source dataset along with the confidence filtering to generate reliable pseudolabeled samples in the target domain for selftraining. On the other hand, it further reduces domain distribution discrepancy through conditional adversarial learning across domains. Extensive experiments show our approach achieves comparable accuracy to supervised models on multiple largescale benchmark datasets.
Unsupervised Domain Adaptation on Reading Comprehension
Yu Cao^{1}, Meng Fang^{2}^{†}^{†}thanks: Corresponding author: Meng Fang ([email protected])., Baosheng Yu^{1}, Joey Tianyi Zhou^{3} ^{1}UBTECH Sydney AI Center, School of Computer Science, FEIT, The University of Sydney, Australia ^{2} Tencent AI Lab ^{3} Institute of High Performance Computing, A*STAR, Singapore [email protected], [email protected], [email protected], [email protected]
Copyright © 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Introduction
Reading comprehension (RC) is a widely studied topic in Natural Language Processing (NLP) due to its value in humanmachine interaction. In past relevant research, a variety of largescale RC datasets were proposed, e.g., CNN/DailyMail (?), SQuAD (?), NewsQA (?), CoQA (?) and DROP (?). With a large number of annotations, these datasets make training endtoend deep neural models possible (?; ?). The more recent studies showed that BERT (?) model achieves higher answer accuracy than human on SQuAD.
However, only unlabeled data is available in many realworld applications. It is a common challenge that machine can learn knowledge well enough in one domain and then answer questions in other domains without any labels. Unfortunately, the generalization capabilities of some existing RC neural models were proven to be weak across different datasets (?). In fact, the same conclusion can be drawn for BERT according to our experiment, e.g., the performance drops on CNN dataset using the model trained on SQuAD. Therefore, studies to eliminate such performance gaps between various datasets deserve effort.
A potential direction to handle it is transferring knowledge from a labeled source domain to a different unlabeled target domain, which is known as unsupervised domain adaptation (?), leveraging data from both domains. However, only few works tried to make unsupervised domain adaptation on RC tasks. Although ? adapted models using a vanilla selftraining, its selflabeling approach cannot ensure the labeling accuracy on a target dataset that differs much from the source one. Besides, it is only applied to some small RC datasets, so its effectiveness on largescale datasets remains unclear and no general representation is learned. Research on large datasets is more meaningful, since they contains more different patterns than small ones. They pose a greater challenge and better fitting realistic conditions, being the basis to build strong deep neural models. In addition, analyzing the possible influential factors for transfer is also necessary, which provide guide for adaptation. Nevertheless, very limited works contribute to it (?).
In this paper, to make use of numerous unlabeled samples in real applications, we focus on unsupervised domain adaptation on large RC datasets. We propose a novel adaptation method, named as Conditional Adversarial Selftraining (CASe). A finetuned BERT model will be obtained on the source domain firstly. Then specifically, in the adaptation stage, an alternated training strategy is applied, containing selftraining and conditional adversarial learning in each epoch. The pseudolabeled samples of the target dataset generated by the last model along with lowconfidence filtering will be used for selftraining. Compared to the method in (?), the filtering prevent model from learning error target domain distribution especially for large datasets. The conditional adversarial learning, whose discriminator input combines BERT features and final output logits, is utilized because the conditioning generates more comprehensive information than feature only. It encourages the model to learn generalized representations and avoid overfitting on the pseudolabeled data.
Moreover, we test the generalization of BERT among 6 large RC datasets to prove the importance of adaptation since it fails under most conditions. The influential factors that caused the failure are also illustrated via analysis.
We validate the proposed method on different pairs of these 6 datasets, and demonstrate the baseline performance.
Our contributions can be summarized as:

•
We propose a new unsupervised domain adaptation method on RC, which is alternatedstaged including selftraining with lowconfidence filtering and conditional adversarial learning.

•
We experimentally evaluate the method on 6 popular datasets, and it shows a comparable performance to models trained on target datasets, which can be regarded as a pioneer study and a baseline for future work^{1}^{1} 1 Code is available: https://github.com/caoyu1991/CASe.

•
We show the transferability among different datasets not only depends on corpus, but also is affected by question forms significantly.
Related Work
Numerous models were proposed for RC tasks. RNET integrates mutual attention and selfattention into RNN encoder to refine the representation (?). QANET (?) leverages similar attention in a stacked convolutional encoder to promote performance. BERT (?) stacks multiple transformers (?). By applying unsupervised pretraining tasks and then finetuning on specific dataset, it achieves stateoftheart performance in various NLP tasks including RC. However, none of them explores the model generalizability across different datasets, and their transferabilities still remain unknown.
Prior work on domain adaptation has been done for several NLP tasks. Some works apply instance weighting on statistical machine translation (SMT) (?) or crosslanguage text classification (?). Crossentropy based method is used to select outdomain sentences for training SMT (?). There are also attempts for RC, showing that the performance of RC models on small datasets can be improved by supervised transferring from a large dataset (?; ?) using annotations from both domains. MultiQA (?) strengthens the generalizability of RC model by training on samples from various datasets. Though some studies concentrate on the generalization of RC models and analyze their performance on multiple datasets (?; ?), they do not analyse the influential factors in detail. A parallel work for RC unsupervised domain adaptation (?) utilizes a simple selflabeling for retraining, and it is evaluated on 3 small datasets containing thousands of samples.
Many relevant works focus on unsupervised domain adaptation for general CV tasks. Cotraining (?) uses two classifiers and two data views to generate labels for unlabeled samples. Both tritraining (?) and asymmetric tritraining (?) extend cotraining by using three classifiers to generate labels, i.e., labels will be added if two classifiers make an agreement. Some approaches try to learn domaininvariant representations by selecting similar instances between domains or adding a classifier to distinguish domains (?; ?). ADDA (?) leverages the Generative Adversarial Networks (GANs) loss on domain label to train a new network. CDAN (?) applies conditional adversarial learning which combines features and labels using a multilinear mapping.
Our work is part of research on unsupervised domain adaptation as well as generalization analysis, with an emphasis on largescale reading comprehension datasets.
Problem Definition
We first describe a standard textspanbased RC task such as SQuAD (?). Given a supporting paragraph $\mathcal{P}=\u27e8{p}_{1},{p}_{2},\mathrm{\dots},{p}_{M}\u27e9$ with $M$ tokens and a query $\mathcal{Q}=\u27e8{q}_{1},{q}_{2},\mathrm{\dots},{q}_{L}\u27e9$ with $L$ tokens, the answer $\mathcal{A}=\u27e8{p}_{{a}^{s}},{p}_{{a}^{s}+1},\mathrm{\dots},{p}_{{a}^{e}}\u27e9$ is a text piece in the original paragraph. This task aims to find out the correct answer span $({a}^{s},{a}^{e}),0\le {a}^{s}\le {a}^{e}\le M$. It means that models used here need to predict two values: the start index and the end index of the answer span.
Unsupervised domain adaptation task for RC then is formally defined as follows. There is a source domain with labeled data and a target domain with unlabeled data. We have $n$ labeled samples ${\{({x}_{i},{y}_{i})\}}_{i=1}^{n}$ in the source domain, in which text ${x}_{i}=({\mathcal{P}}_{i},{\mathcal{Q}}_{i})$ and label ${y}_{i}=({a}_{i}^{s},{a}_{i}^{e})$, and ${n}^{\prime}$ unlabeled target domain samples ${\{({x}_{j}^{\prime})\}}_{j=1}^{{n}^{\prime}}$, sharing the same standard RC task as described above. We assume that the data in source domain is sampled from distribution $\mathcal{D}(x,y)$ and the data in target domain is sampled from distribution ${\mathcal{D}}^{\prime}({x}^{\prime},{y}^{\prime})$, $\mathcal{D}\ne {\mathcal{D}}^{\prime}$. Our goal is to find a deep neural model that can reduce the distribution shift and achieves the optimal performance on the target domain.
Domain Adaptation Method
The main purpose of our approach is to provide a way to transfer the model for labeled data in the source domain to the target unlabeled domain. Generally, a model with good generalization can reduce the discrepancy of intermediate states generated from different distributions (?). We use the BERT model (?), which is a pretrained contextual model based on unsupervised NLP tasks with a huge 3.3billionword corpus. Its model depth and huge training data size ensure that it can generate universal feature representations under a variety of linguistic conditions. And we consider applying adversarial learning to minimize crossdomain discrepancy between $\mathcal{D}(x,y)$ and ${\mathcal{D}}^{\prime}({x}^{\prime},{y}^{\prime})$ (?). Moreover, pseudolabel based selftraining (?) with lowconfidence filtering is also utilized for further leveraging unlabeled data in the target domain.
The framework of the proposed Conditional Adversarial Selftraining (CASe) approach for unsupervised domain adaptation on RC is illustrated in Figure 1. Our model has three components: a BERT feature network, an output network, and a discriminator network. There are 3 steps in CASe. Firstly, we finetune the BERT feature model and output network on the source domain. Secondly, we use selftraining on the target domain to get distributionshifted model. Thirdly, we apply conditional adversarial learning on both domains to further reduce feature distribution divergence. The second and third steps will be proceed iteratively.
Training on the Source Domain
Since we have the labeled data in the source domain, we extend and finetune the unsupervised pretrained base BERT model on these samples. The BERT feature $\overline{\mathbf{f}}\in {\mathbb{R}}^{m\times d}$ is firstly obtained, in which $m$ and $d$ are the maximum input sequence length and the hidden state dimension in BERT respectively. Then a singlelayer linear output network with 2dimension output vector is added following BERT. One of its output value is used as the answer start logits ${\mathbf{g}}^{\mathbf{s}}\in {\mathbb{R}}^{m}$ and the other one is used as the answer end logits ${\mathbf{g}}^{e}\in {\mathbb{R}}^{m}$. Finally, the supervised pretrained BERT model and output network can be obtained by optimizing the following loss function:
$$\mathcal{L}=\frac{1}{2}\left({f}_{CE}({\mathbf{g}}^{s},{a}^{s})+{f}_{CE}({\mathbf{g}}^{e},{a}^{e})\right),$$  (1) 
where ${f}_{CE}$ is the cross entropy loss function, ${a}^{s}$ and ${a}^{e}$ are labels for the answer start and end indices, respectively.
To further enhance the regularization of BERT, we add a batch normalization layer (?) between the BERT feature $\overline{\mathbf{f}}\in {\mathbb{R}}^{m\times d}$ and the output network.
Selftraining on the Target Domain
After obtaining the pretrained model from the source domain, we use it to predict sample labels in the target domain. Although data distribution is possibly different between domains, we can still make an assumption that different domains share some similar characteristics. That is, some predicted answers will be similar to or the same as correct answer spans even in a new domain. These predictions combined with corresponding samples ${x}^{\prime}=(\mathcal{P},\mathcal{Q})$ in the target domain, named as pseudolabeled samples, can be used to teach the model about a new distribution.
Similar to the method in asymmetric tritraining (?), to avoid significant error propagation, we select predictions of high confidence as pseudo labels. Since our model generates probabilities for every predicted answer start and end index, a threshold ${T}_{prob}$ will be employed to filter lowconfidence samples.
Normally, we apply a softmax function to all output logits and regard generated values as possibilities for indices being the answer start or end index. However, the passage length is usually very large in RC tasks, leading to a very small probability value for each index. This method reduces the numerical distinctions between possibilities and brings more noise, which affects the effectiveness of thresholdbased filtering. We thus select a set $\mathcal{U}$ of ${n}_{best}$ start and end index pairs firstly. These pairs have top${n}_{best}$ sums of start index logits ${g}_{i}^{s}$ and end index logits ${g}_{j}^{e},0\le i\le j\le M$ for corresponding answer spans involved in the target domain, i.e.,
$$\mathcal{U}=\{{(i,j)}_{1},\mathrm{\dots},{(i,j)}_{{n}_{best}}\}=\underset{(i,j)}{\mathrm{arg}\underset{{n}_{best}}{\mathrm{max}}}({g}_{i}^{s}+{g}_{j}^{e}).$$  (2) 
A softmax function then is applied to these ${n}_{best}$ sums. The span with the highest value after softmax will be regarded as the predicted span and its value is defined as the generating probability ${p}^{g}$ for current sample, i.e.,
$${p}^{g}=\mathrm{max}(\mathrm{softmax}(\{{g}_{i}^{s}+{g}_{j}^{e}\})),(i,j)\in \mathcal{U}.$$  (3) 
Samples with ${p}^{g}\ge {T}_{prob}$ will be put into pseudolabeled sample set using the predicted start and end indices as their labels, $\widehat{a^{s}{}^{\prime}}$ and $\widehat{a^{e}{}^{\prime}}$. The model is trained similar to (1), but ${a}^{s}$ and ${a}^{e}$ are replaced by $\widehat{a^{s}{}^{\prime}}$ and $\widehat{a^{e}{}^{\prime}}$, respectively.
In each epoch during adaptation, pseudolabeled samples are always generated by the last model and previous ones will be abandoned, while ${T}_{prob}$ keeps the same.
Conditional Adversarial Learning
Adversarial learning leverages a discriminator to predict domain classes. But most models only use feature representations for prediction (?; ?), which may be insufficient because the joint distribution of features and labels is not identical across domains.
Since our spanbased RC tasks can be regarded as a multiclass classification problem and the span properties vary across domains, it poses more challenges for discriminators based only on features. Inspired by the Conditional Adversarial Network (CDAN) (?), we utilize conditional adversarial learning fusing feature $\mathbf{f}$ and output logits $\mathbf{g}$ for a comprehensive representation, whose network architecture is illustrated in Figure 2. It is noted that $\mathbf{f}\in {\mathbb{R}}^{m\times d}$ is the BERT feature after the batch normalization layer.
One approach to condition discriminator $D$ on $\mathbf{g}$ is using multilinear map, which is the outer product $\mathbf{x}\otimes \mathbf{y}$ of two vectors and is superior than concatenation (?). However, it results in dimension explosion and the output dimension is $m\times d\times 2m$ in our application, which is impossible to be embedded. Following CDAN, we tackle it in a randomized approach. The multilinear map of two pairs of features and outputs can be approximated by
$$\u27e8\mathbf{f}\otimes \mathbf{g},{\mathbf{f}}^{\prime}\otimes {\mathbf{g}}^{\prime}\u27e9\approx \u27e8{Z}_{R}(\mathbf{f},\mathbf{g}),{Z}_{R}({\mathbf{f}}^{\prime},{\mathbf{g}}^{\prime})\u27e9,$$  (4) 
where ${Z}_{R}$ is a randomly sampled multilinear map and generates a vector of dimension ${d}_{R}\ll m\times d\times 2m$. Given two randomly initialized matrices fixed during training ${\mathbf{R}}_{\mathbf{f}}\in {\mathbb{R}}^{{d}_{R}\times m}$ and ${\mathbf{R}}_{\mathbf{g}}\in {\mathbb{R}}^{{d}_{R}\times 2m}$, ${Z}_{R}$ can be defined as
$${Z}_{R}(\mathbf{f},\mathbf{g})=\frac{1}{\sqrt{{d}_{R}}}\left({\mathbf{R}}_{\mathbf{f}}av{g}_{\mathrm{col}}(\mathbf{f})\right)\circ \left({\mathbf{R}}_{\mathbf{g}}\mathbf{g}\right).$$  (5) 
Here, $\mathbf{g}={\mathbf{g}}^{s}\oplus {\mathbf{g}}^{e}\in {\mathbb{R}}^{2m}$. $av{g}_{\mathrm{col}}$ means average along columns, transforming the feature matrix into a vector in ${\mathbb{R}}^{m}$, $\circ $ is elementwise multiplication.
The discriminator is a 3layer linear network, whose final layer has a 1dimension output with sigmoid as the activation function to get a scalar between 0 and 1. And we directly adopt ${Z}_{R}(\mathbf{f},\mathbf{g})$ as its input for computation efficiency.
All 3 components, BERT feature network, output network, and discriminator network, are jointly optimized in this stage because discriminator conditions both features and outputs. The loss function is the binary cross entropy loss
$${\mathcal{L}}_{adv}={y}^{d}\mathrm{log}({\widehat{y}}^{d})+(1{y}^{d})\mathrm{log}(1{\widehat{y}}^{d}),$$  (6) 
where $\widehat{{y}^{d}}$ is the prediction value from $D$ for domain label, while ${y}^{d}\in \{0,1\}$ is the ground truth label, 0 stands for the source domain and 1 for the target domain. Samples $x,{x}^{\prime}$ from both domains will be used for joint training.
However, such an optimization imposes equal importance to different samples, while samples that are hard to transfer will pose negative effect on domain adaptation. We quantify the uncertainty of a sample using entropy $E(\mathbf{p})={\sum}_{i=1}^{M}({p}_{i}^{s}\mathrm{log}{p}_{i}^{s}+{p}_{i}^{e}\mathrm{log}{p}_{i}^{e}$), to ensure a more effective transfer. ${p}_{i}^{s}$ and ${p}_{i}^{e}$ are probabilities for $i$th token being the answer start or end index, which can be obtained by applying softmax to whole output logits ${\mathbf{g}}^{s}$ and ${\mathbf{g}}^{e}$. We encourage the discriminator to place a higher priority for samples that are easy to transfer. In other words, samples with lower entropy will have higher weights during the conditional adversarial learning (CASe+E). The adversarial loss function can be reformed using the weight $w$ derived from entropy, i.e.,
$${\mathcal{L}}_{advE}=w\cdot {\mathcal{L}}_{adv},w=1+{e}^{E(\mathbf{p})}.$$  (7) 
No matter which loss is employed, the conditional adversarial learning makes the feature model and the output model more transferable and generalizable.
Algorithm
The entire procedure of CASe is shown in Algorithm 1. It is noted that no adversarial learning is included in the last epoch of domain adaptation. This aims to make the final model better fit the target domain, because adversarial learning will enhance generalization while affects fitting in specific domains. In step 16 we balance the label number of different domains by removing samples randomly from the larger dataset in merging to avoid unbalanced training.
Algorithm 1:CASe. Given a BERT feature network $F$, 
an output network $G$, and a discriminator $D$. Pre 
training epoch number is ${N}_{pre}$ and domain adaptation 
training epoch number is ${N}_{da}$ 
Input: data in the source domain $\mathcal{S}=\{({\mathcal{P}}_{i},{\mathcal{Q}}_{i},{a}_{i}^{s},$ 
${a}_{i}^{e})\}{}_{i=1}{}^{n}$, data in the target domain ${\mathcal{S}}^{\prime}={\{({\mathcal{P}}_{i}^{\prime},{\mathcal{Q}}_{i}^{\prime})\}}_{i=1}^{{n}^{\prime}}$. 
Output: Optimal model $F$, $G$ in the target domain 
1 for j=1 to ${N}_{pre}$ do 
2 Train $F$ and $G$ with minibatch from $\mathcal{S}$ 
3 end for 
4 for j=1 to ${N}_{da}$ do 
5 Pseudo labeled set ${\mathcal{S}}^{P}=\mathrm{\varnothing}$ 
6 for k=1 to ${n}^{\prime}$ do 
7 Use $F$, $G$ to predict the label $\widehat{a_{k}^{s}{}^{\prime}}$ and $\widehat{a_{k}^{e}{}^{\prime}}$ for 
$({\mathcal{P}}_{k}^{\prime},{\mathcal{Q}}_{k}^{\prime})$ and get probability ${p}_{k}^{g}$ 
8 if ${p}_{k}^{g}\ge {T}_{prob}$ do 
9 Put $({\mathcal{P}}_{k}^{\prime},{\mathcal{Q}}_{k}^{\prime},\widehat{a_{k}^{s}{}^{\prime}},\widehat{a_{k}^{e}{}^{\prime}})$ into ${\mathcal{S}}^{P}$ 
10 end if 
11 end for 
12 for minibatch $\mathcal{B}$ in ${\mathcal{S}}^{P}$ 
13 Train $F$ and $G$ with minibatch $\mathcal{B}$ 
14 end for 
15 if j $$ do 
16 $\mathcal{R}=({\{({\mathcal{P}}_{i},{\mathcal{Q}}_{i})\}}_{i=1}^{n})\cup {\mathcal{S}}^{\prime}$ 
17 for minibatch $\mathcal{B}$ in $\mathcal{R}$ 
18 Train $F$,$G$,$D$ with $\mathcal{B}$ and domain labels 
19 end for 
20 end if 
21 end for 
Experiment
In this section, we first evaluate the generalization of BERT among 6 recently release RC datasets and analyze influential factors. Then the performance of proposed CASe for unsupervised domain adaptation on these datasets be given, along with ablation study and the effects of hyperparameters.
Dataset
SQuAD (?) contains 87k training samples and 11k validation (dev) samples, with questions in natural language given by workers based on paragraphs from Wikipeida, and answers are in text span forms.
CNN and DailyMail (?) contains 374k training and 4k dev samples, 872k training and 64k dev samples respectively. Their questions are in cloze forms and answers are masked entities in passages.
NewsQA (?) contains 120k samples in total, in which QA pairs were generated by crowded workers in natural forms with text spans based on stories from CNN.
CoQA (?) contains 109k training samples and 8k dev samples, questions are given as conversation forms with multiple turns and answers are in various types including text spans and yes/no.
DROP (?) contains 77k training samples and 9.5k dev samples, given by workers on Wikipedia. It mainly focuses on numerical reasoning and involves answers in numbers or dates except text spans.
Since CNN and DailyMail is much larger than other datasets, we uniformly sampled subsets from two datasets as data source to speed up experiments. The keep ratio is 1/4 and 1/10 respectively, resulting in similar scales as others.
In addition, we preprocessed samples to conduct answer spans for several datasets. The answers in CNN and DailyMail are mask symbols such as ”@entity1” which may appears several times in the text. We use a heuristic method to extract spans: 1) find all position indices $\{{a}_{i}\}$ of answer masks in a passage; 2) find all position indices $\{\{{e}_{i}^{1}\},\mathrm{\dots},\{{e}_{i}^{K}\}\}$ of all $K$ question entities in passage; 3) calculate the sum of absolute index distances between an answer appearance ${a}_{j}$ and every question entity nearest to it, and ${a}_{j}$ with the smallest sum will be used as answer index. All masks in these two datasets are also replaced with homologous original tokens. CoQA contains answers not in text span form. We follow the F1socrebased method in original paper to obtain the best answer spans. And the concatenation of all previous QA pairs along with the original question in current turn is used as new question. Samples with yes/no as answers or no answer span being found will be discarded. Similarly, we only remain answerable questions with text spans as answers in NewsQA and DROP.
The characterizations of 6 processed datasets are shown in Table 1. DROP is significantly smaller than others because answers of quantitive reasoning samples are not extractive.
Dataset  Train  Dev  Corpus  Question 

SQuAD  87,599  10,570  Wikipedia  crowd 
CNN  93,627  3,833  CNN news  cloze 
DailyMail  87,253  6,372  Daily mail  cloze 
NewsQA  76,341  4,327  CNN news  crowd 
CoQA  86,077  6,272  Multiple${}^{*}$  crowd 
DROP  28,267  3,389  Wikipedia  crowd 
Implementation Detail
We implement CASe based on the BERT implementation in PyTorch by Hugging Face, using the baseuncased pretrained model with 12 layers and 768dim hidden state. The maximum input length $m$ is 512 in which the maximum query length is 40. The random sampling dimension ${d}_{R}$ is 768. The input dimension of the first layer in the adversarial network is 768. And its intermediate dimension is 512, using ReLU as the activation function in first two layers. Generating probability threshold ${T}_{prob}$ is set as 0.4 and ${n}_{best}=20$. Adam optimizer (?) is employed with learning rate $3\times {10}^{5}$ in the source domain training, $2\times {10}^{5}$ in the selftraining and ${10}^{5}$ in the adversarial learning, with batch size 12. A dropout with rate 0.2 is applied on both the BERT feature network and the discriminator. We set the epoch number ${N}_{pre}=3$ in pretraining and ${N}_{da}=4$ in domain adaptation.
Besides, since the input length may be larger than $m$, we truncate a passage using a sliding window to fit the input length whose moving step is 128. And text pieces excluding the answers will be discarded in training.
Generalization and Influential Factors
We firstly test the generalization capability of BERT by finetuning it on one dataset and directly applying it to another dataset without any change. We call such models as zeroshot models. The performance on dev sets for transferring among 6 datasets is shown in Table 2.
In a highlevel observation, the performance of zeroshot models drops significantly in most cases except the transferring between CNN and DailyMail. The average 55.8% reduction in exact match (EM) and 50.0% reduction in F1 compared to models trained on the target dataset (Self) prove that BERT cannot generalize well to unseen datasets, despite a huge corpus is used in unsupervised pretraining.
Datasets  SQuAD  CNN  DailyMail  NewsQA  CoQA  DROP 

SQuAD    16.72 / 26.42  21.12 / 21.70  40.03 / 57.42  29.58 / 39.58  19.06 / 29.73 
CNN  18.97 / 24.34    81.53 / 83.59  9.38 / 15.36  7.10 / 10.26  4.40 / 7.50 
DailyMail  9.72 / 14.76  77.22 / 79.73    5.89 / 10.69  5.68 / 8.75  4.69 / 8.02 
NewsQA  64.80 / 78.32  25.10 / 34.66  28.41 / 38.44    27.14 / 38.75  12.36 / 21.00 
CoQA  65.25 / 74.92  18.21 / 24.76  22.65 / 28.12  37.74 / 53.85    14.75 / 21.60 
DROP  55.53 / 68.36  14.32 / 22.26  17.44 / 25.78  28.36 / 44.35  16.15 / 24.82   
Self  79.85 / 87.46  82.76 / 84.73  81.37/ / 83.33  52.05 / 67.41  48.98 / 63.99  44.67 / 52.51 
Datasets  SQuAD  CNN  DailyMail  NewsQA  CoQA  DROP 

SQuAD    80.64 / 82.24  80.78 / 82.77  52.69 / 68.15  52.38 / 67.56  50.34 / 57.53 
CNN  79.86 / 87.65    84.26 / 86.01  48.37 / 63.47  51.71 / 67.09  45.59 / 53.57 
DailyMail  79.04 / 87.07  78.06 / 80.36    50.13 / 65.90  50.06 / 65.76  41.69 / 50.07 
NewsQA  80.17 / 88.14  79.60 / 81.57  80.93 / 82.99    50.05 / 66.49  47.36 / 56.42 
CoQA  78.38 / 85.93  74.75 / 76.65  76.87 / 78.88  51.21 / 65.83    42.08 / 50.07 
DROP  74.03 / 83.35  77.09 / 79.03  80.34 / 82.49  51.91 / 66.95  48.90 / 64.29   
SQuAD    80.20 / 81.93  79.91 / 82.06  51.56 / 66.79  50.77 / 65.94  48.45 / 57.33 
CNN  78.59 / 86.39    83.40 / 85.06  48.95 / 64.45  49.38 / 64.57  44.15 / 51.87 
DailyMail  78.07 / 86.22  82.44 / 84.36    50.91 / 65.90  48.64 / 63.80  41.58 / 47.74 
NewsQA  78.87 / 87.06  80.49 / 82.43  80.93 / 82.99  80.99 / 83.07  48.01 / 64.30  45.06 / 54.34 
CoQA  78.24 / 85.80  76.34 / 78.22  78.12 / 79.88  50.80 / 65.55    41.43 / 49.40 
DROP  74.81 / 83.67  80.38 / 82.21  80.78 / 82.96  50.01 / 65.16  46.27 / 62.67   
Self  79.85 / 87.46  82.76 / 84.73  81.37/ / 83.33  52.05 / 67.41  48.98 / 63.99  44.67 / 52.51 
Taking a closer look, we can find the reductions vary across different dataset pairs. The drops of transferring among 4 datasets, SQuAD, NewsQA, CoQA and DROP, are smaller than transferring to/from rest 2 datasets, especially from latter 3 ones to SQuAD. And the transferring between CNN and DailyMail achieves equivalent performance to Self. CNN and NewsQA share the same corpus but the transferring fails due to different question forms(natural vs. cloze), and the corpus discrepancy of SQuAD and NewsQA leads to homologous result. On the other hand, the same question forms and similar corpora of CNN and DailyMail make successful transferring. Therefore, it can be concluded that not only the corpus but also the question form affect the generalization. It is also observed that the different focus as well as reasoning types affect the transfer between datasets even with same corpus and question type, i.e. simple singlesentence reasoning in SQuAD vs. complex reasoning (comparison, selection) in DROP.
We visualize the relations between 6 datasets using forcedirected graph in Figure 3. The force between every two datasets can be calculate via ${F}_{ij}={P}_{ij}\mathrm{/}{P}_{j}+{P}_{ji}\mathrm{/}{P}_{i}$. ${P}_{ij}$ is the average performance of EM and F1 from source dataset $i$ to target dataset $j$, and ${P}_{i}$ is the average performance of Self model on dataset $i$. Edge widths are positively correlated to force $F$ between nodes, while the size of each node reflects dataset scale. It is noted that datasets cluster more significantly according to question forms (node shapes), comparing to corpora (node colors) who also affect it.
Domain Adaptation Performance of CASe
We now evaluate the performance of proposed CASe method for unsupervised domain adaptation on RC datasets, including standard CASe and CASe with entropyweighted loss in adversarial learning (CASe+E). The results are shown in Table 3. Generally speaking, no matter which loss function is used in adversarial learning, CASe achieves significant performance improvement compared to zeroshot models. Despite annotated data is unavailable in the target domain, most results are comparable to Self models, and some of them are even better. In conclusion, CASe transfers knowledge from one domain to another one successfully.
Domain adapted models between two very alike datasets, CNN and DailyMail, shows a higher accuracy than Self. They are similar on both corpora and question forms, which means more valid data can be utilized for selftraining to get a model with deeper comprehension. Zeroshot model performs poorly when transferring between naturalquestionbased datasets and clozequestionbased datasets, e.g., SQuAD to CNN. But CASe can nearly eliminate such gaps between transferred model and Self models due to the new distribution learned in selftraining and generalized representation optimized in adversarial learning. The performance of most adaptations on CoQA and DROP is better than Self because they benefit from more extra data.
Entropybased loss weighting also show its effectiveness because it makes learning focus on samples simple to be transferred so as to obtain more correct knowledge in the target domain. And CASe+E shows 0.5% to 2% higher in accuracy than CASe under most conditions except some specific dataset pairs such as DailyMail to CNN.
Ablation Study
We do ablation test on 4 domain adaptation dataset pairs, which are CNN to SQuAD (C$\mathrm{\to}$S), DailyMail to CNN (D$\mathrm{\to}$C), CNN to NewsQA (C$\mathrm{\to}$N) and SQuAD to CoQA (S$\mathrm{\to}$Co), including adaptation between datasets with same/different question forms and/or corpora. The EM results on ablated models are shows in Table 4, in which  conditional means using unconditional adversarial learning instead of conditional one, while  Adv learning for removing whole adversarial learning,  Selftraining for removing selftraining and  Batch norm for removing batch normalization, all based on CASe. It is observed that selftraining plays the most important role under all configurations. Performance drops without discriminator conditioning on output or whole adversarial learning. Batch normalization has slight effect, removing it promotes the results under two configurations while it has opposite effect under others.
Generalization after domain adaptation
We test the performance of transferred models on the source datasets to check their generalization, which is shown in Table 5. 4 datasets pairs in ablation study is involved plus NewsQA to DROP (N$\mathrm{\to}$Dr). There are performance declines compared to models trained on the source datasets, except D$\mathrm{\to}$C in which datasets have very similar properties. It means our CASe method results in a good transferred model at the meantime leads to knowledge loss in the source domain.
Impact of ${T}_{prob}$
Figure 4(a) demonstrates the performance of CASe and CASe+E on C$\mathrm{\to}$S varied with different generating probability ${T}_{prob}$ in terms of accuracy and F1 scores. CASe+E shows higher stability and performance than CASe under different ${T}_{prob}$. CASe and CASe+E reach their peaks at 0.3 and 0.4 respectively, while both of them show descending trends when ${T}_{prob}\ge 0.4$.
The numbers of generated pseudolabeled samples in every epoch on C$\mathrm{\to}$S with different ${T}_{prob}$ are shown in Figure 4(b). Obviously, a lower threshold results in more samples and longer training time. Although CASe generate more samples stably than previous epoch, samples generated by CASe+E may decrease in the 2nd epoch, but more samples will be generated latter compared to CASe. Thus CASe+E achieves better results under most conditions because more valid samples are utilized. Considering the overall performance as well as the tradeoff between accuracy and complexity, we set ${T}_{prob}$ as 0.4 in our experiment.
C$\mathrm{\to}$S  D$\mathrm{\to}$C  C$\mathrm{\to}$N  S$\mathrm{\to}$Co  

CASe+E  66.46  78.06  48.37  52.38 
CASe  65.24  82.44  48.95  50.77 
 conditional  64.47  82.26  47.31  50.25 
 Adv learning  65.05  81.21  47.89  49.05 
 Selftraining  16.55  77.07  14.26  23.81 
 Batch norm  65.97  81.91  48.27  51.08 
C$\mathrm{\to}$S  D$\mathrm{\to}$C  C$\mathrm{\to}$N  S$\mathrm{\to}$Co  N$\mathrm{\to}$Dr  

CASe+E  66.37  82.19  64.65  52.97  40.07 
CASe  68.61  81.61  65.43  51.48  40.17 
Self  80.77  80.85  80.77  66.51  52.05 
Impact of epoch number
In Figure 4(c), we present the performance of CASe and CASe+E after different stages in every epoch on C$\mathrm{\to}$S. E.g., 1s means result after the selftraining stage in 1st epoch, 2a means results after conditional adversarial learning stage in 2nd epoch. CASe+E shows obvious fluctuations between the selftraining and the adversarial learning compared to CASe. Not matter CASe or CASe+E, the performance tends to be saturated after 3 complete epochs. That is the reason why we set ${N}_{da}$ as 4.
Conclusion
In this paper, we explore the possibility of transferring reading comprehension model from a largescale labeled dataset to another unlabeled one. Our experiment proves that even the BERT model cannot generalize well between different datasets, and the divergence of both corpora and question forms results in this failure. Then we propose a new unsupervised domain adaptation method, Conditional Adversarial Selftraining (CASe). After finetuning a BERT model on source data, it uses selftraining and conditional adversarial learning alternately in every epoch to make the model better fit the target domain and reduce the domain distribution discrepancy. The experimental results among 6 RC datasets demonstrate the effectiveness of CASe. It promotes performance remarkably over zeroshot models, showing similar accuracies to supervised trained on the target domain.
Acknowledgements
We thank Boqing Gong and the anonymous reviewers for insightful comments and feedback.