MLQA: Evaluating Cross-lingual Extractive Question Answering

  • 2019-11-07 05:46:56
  • Patrick Lewis, Barlas Oğuz, Ruty Rinott, Sebastian Riedel, Holger Schwenk
  • 0


Question answering (QA) models have shown rapid progress enabled by theavailability of large, high-quality benchmark datasets. Such annotated datasetsare difficult and costly to collect, and rarely exist in languages other thanEnglish, making training QA systems in other languages challenging. Analternative to building large monolingual training datasets is to developcross-lingual systems which can transfer to a target language without requiringtraining data in that language. In order to develop such systems, it is crucialto invest in high quality multilingual evaluation benchmarks to measureprogress. We present MLQA, a multi-way aligned extractive QA evaluationbenchmark intended to spur research in this area. MLQA contains QA instances in7 languages, namely English, Arabic, German, Spanish, Hindi, Vietnamese andSimplified Chinese. It consists of over 12K QA instances in English and 5K ineach other language, with each QA instance being parallel between 4 languageson average. MLQA is built using a novel alignment context strategy on Wikipediaarticles, and serves as a cross-lingual extension to existing extractive QAdatasets. We evaluate current state-of-the-art cross-lingual representations onMLQA, and also provide machine-translation-based baselines. In all cases,transfer results are shown to be significantly behind training-languageperformance.


Quick Read (beta)

MLQA: Evaluating Cross-lingual Extractive Question Answering

Patrick Lewis*    Barlas Oğuz*    Ruty Rinott*    Sebastian Riedel*    Holger Schwenk*
*Facebook AI Research     University College London

Question answering (QA) models have shown rapid progress enabled by the availability of large, high-quality benchmark datasets. Such annotated datasets are difficult and costly to collect, and rarely exist in languages other than English, making training QA systems in other languages challenging. An alternative to building large monolingual training datasets is to develop cross-lingual systems which can transfer to a target language without requiring training data in that language. In order to develop such systems, it is crucial to invest in high quality multilingual evaluation benchmarks to measure progress. We present MLQA, a multi-way aligned extractive QA evaluation benchmark intended to spur research in this area.11 1 MLQA will be made publicly available at MLQA contains QA instances in 7 languages, English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese. It consists of over 12K QA instances in English and 5K in each other language, with each instance being parallel between 4 languages on average. MLQA is built using a novel alignment strategy on Wikipedia articles, and serves as a cross-lingual extension to existing extractive QA datasets. We evaluate state-of-the-art cross-lingual models on MLQA, and provide machine-translation-based baselines. In all cases, transfer results are shown to be significantly behind training-language performance.

MLQA: Evaluating Cross-lingual Extractive Question Answering

Patrick Lewis*    Barlas Oğuz*    Ruty Rinott*    Sebastian Riedel*    Holger Schwenk* *Facebook AI Research     University College London {plewis,barlaso,ruty,sriedel,schwenk}

1 Introduction

Question answering (QA) is a central and highly popular area in NLP. There is an abundance of datasets available to tackle QA from various angles, including reading comprehension, cloze-style completion, and open domain QA (richardson_mctest:_2013; rajpurkar_squad:_2016; chen_reading_2017; kwiatkowski_natural_2019). The field has made tremendous advances in recent years, to the point of exceeding human performance in some settings (devlin_bert:_2018; alberti_synthetic_2019).

Despite such popularity and widespread application, QA datasets in languages other than English remain scarce, even for relatively high-resource languages (asai_multilingual_2018). Collecting such datasets at sufficient scale and quality is difficult and costly. There are two reasons why this unavailability of data prevents the internationalization of QA systems. First, we cannot measure any progress on multilingual QA without relevant benchmark data. Second, we cannot easily train end-to-end QA models on the task, and arguably most recent successes in QA have been observed in fully supervised settings. Given recent progress in zero-shot and cross lingual learning for tasks such as document classification (Lewis et al., 2004; klementiev_inducing_2012; Schwenk and Li, 2018), semantic role labelling (akbik_generating_2015) and natural language inference (conneau_xnli:_2018), we argue that while multilingual QA training data might be useful but not strictly necessary, multi-lingual evaluation data is a must-have.

Recognising the need for multilingual datasets both from the perspective of training and evaluation, several cross-lingual datasets have recently been assembled (asai_multilingual_2018; liu_xqa:_2019). However, these generally cover only a small number of languages, combine data from different authors and annotation protocols, lack parallel instances, or explore less practically-useful QA domains or subtasks, see Section 3 for more details. Highly parallel data is particularly attractive, as it enables fairer comparison between languages, requires fewer instances to be annotated in the source language, and allows for additional evaluation setups such as applying questions from one language to contexts from another, at no extra annotation cost.

A high-quality, purpose-built evaluation benchmark dataset covering a broad range of diverse languages, and following the popular extractive QA paradigm on a practically-useful domain would be the ideal testbed for cross-lingual QA models. With this work, we present such a benchmark, MLQA, and hope that it serves as an accelerator for the field of multilingual question answering in the way datasets such as SQuAD (rajpurkar_squad:_2016) have done for its mono-lingual counterpart. MLQA is a multi-way parallel extractive question answering evaluation benchmark in seven languages: English, Arabic, German, Vietnamese, Spanish, Simplified Chinese and Hindi.

To construct the dataset, we first automatically identify sentences from Wikipedia articles which have the same or similar meaning in multiple languages. We extract such sentences with their surrounding context, forming a set of context paragraphs. We then crowd-source questions on the English paragraphs, making sure the answer is in the aligned sentence. This makes it possible to answer the question in all languages in the vast majority of cases.22 2 The automatically aligned sentences occasionally differ in a named entity or information content, or some questions may not make sense without the surrounding context. In these rare cases, there may be no answer for some languages. As a last step, the generated questions are translated to all target languages by professional translators, and the corresponding answer spans are annotated in the aligned context for that target language.

The resulting corpus has between 5,029 and 6,006 QA instances in each language (12,738 in English) where each instance has an aligned equivalent in multiple other languages (always including English), the majority being 4-way aligned. Combined, there are over 46,000 individual QA annotations.

We define two tasks to assess performance on MLQA. The first, cross-lingual transfer (XLT), requires models trained in one language (in our case English) to transfer to test data in a different language. The second, generalised cross-lingual transfer (G-XLT) requires the model to transfer to test data where the question and context languages are different, e.g. questions in Hindi and contexts in Arabic, a setting only possible because MLQA is highly parallel.

We provide baselines using the best available cross-lingual technology. We develop machine translation baselines which map answer spans based on the attention matrices from the translation model, and use multilingual BERT (devlin_bert:_2018) and XLM (Lample and Conneau, 2019) as zero-shot approaches. We use English for our training language and adopt SQuAD as a training dataset. We find that zero-shot XLM transfers best overall, but all models lag well behind training-language performance.

In summary, we make the following contributions: i) We develop a novel annotation pipeline to construct large multilingual, highly-parallel extractive QA datasets ii) We release MLQA, a 7-language evaluation benchmark for cross-lingual QA iii) We define two cross-lingual QA tasks, including a novel generalised cross-lingual task (G-XLT) iv) We provide baselines using state-of-the-art techniques, and demonstrate significant room for improvement.

2 The MLQA corpus

First, we state our desired properties for a cross-lingual QA evaluation dataset. We note that whilst some existing datasets exhibit some of these properties, none exhibit all of them in combination (see Section 3). We then describe our annotation protocol, which seeks to fulfil these desiderata.


The dataset should consist of instances that are parallel across many languages. Firstly, this makes comparison of QA performance as a function of transfer language fairer. Secondly, additional evaluation setups become possible, as questions in one language can be applied to documents in another. Finally, annotation cost is also reduced if more instances can be shared between languages, i.e. for a dataset with N instances in K target languages, a dataset where each target language is only parallel with the source language would require NK instances in the source language, but a N-way parallel dataset would only require K.

Natural Documents

Building a parallel QA dataset in many languages requires access to parallel documents in those languages. Manually translating documents into target languages at sufficient scale would require a huge translator workload, and could result in unnatural documents. Finding a way to exploit existing naturally-parallel documents is advantageous, providing high-quality documents without requiring manual translation.

Diverse Languages

One of the primary goals of cross-lingual understanding research is to develop systems that work well in many languages. The dataset should enable quantitative performance comparison across several languages, including languages with different quantities of linguistic resources, different language families and scripts.

Extractive QA

Popular cross-lingual understanding benchmarks are typically based on classification (conneau_xnli:_2018). Extracting spans in different languages represents a different language understanding challenge. Whilst there are extractive QA datasets in a number of languages (see Section 3), most were created at different times by different teams with different annotation interfaces and procedures, making quantitative comparisons challenging.

Textual Domain

We require a naturally highly language-parallel textual domain, as mentioned above. Also, it is desirable to select a textual domain that matches existing extractive QA training resources, in order to isolate the change in performance due to language transfer.



Figure 1: MLQA annotation pipeline. Only one target language is shown for clarity. Left: We first identify N-way parallel sentences ben, b1bN-1 in Wikipedia articles on the same topic, and extract the paragraphs that contain them, cen, c1cN-1. Middle: Workers formulate questions qen from cen for which answer aen is a span within ben. Right: English questions qen are then translated by professional translators into all languages qi and the answer ai is annotated in the target language context ci such that ai is a span within bi.

To build a dataset that best satisfies the desiderata above, we identified the following procedure, described below, and illustrated in Figure 1.

Wikipedia represents a convenient textual domain, with its size and multi-linguality enabling collection of data in many diverse languages at scale. It has been used to build many available QA training resources, allowing us to leverage these to train QA models, instead of having to build our own training dataset. Our annotation pipeline consists of three main steps:

Step 1) We automatically mine sentences that are parallel from Wikipedia articles on the same topic in each of our languages (left of Figure 1). Parallel Sentence mining allows us to leverage naturally-written documents and avoid document translation, which would be expensive and result in potentially unnatural documents.

Step 2) We choose one language to be the source language (English) and employ crowd-workers to annotate questions and answer spans on English paragraphs which contain a parallel sentence (centre of Figure 1). Annotators must choose answer spans within the parallel source sentence. This allows us to annotate questions in the source language which have a high probability of being answerable in the target languages, even if the rest of the context paragraphs are very different.

Step 3) Next, we employ professional translators to translate the questions to the target languages. Finally, we employ bilingual annotators to annotate answer spans in the aligned sentences in the target language (right of Figure 1)

The following Sections describe each step in the data collection pipeline in more detail.

2.1 Sentence alignment

An important aspect of our approach is to use the original contexts in the target languages as they appear in Wikipedia. To make sure that questions can be answered in every target language, we use contexts containing an N-way parallel sentence.

Our approach is similar to the WikiMatrix project (Schwenk et al., 2019) which extracts parallel sentences for many language pairs in Wikipedia. However, there are two important differences: we limit the search for parallel sentences to documents of the same topic only, and we are aiming at N-way parallel sentences. To detect parallel sentences we use the freely available LASER toolkit,33 3 which achieves state-of-the-art performance in mining parallel sentences (artetxe_margin-based_2018). LASER uses multilingual sentence embeddings and a distance or margin criterion in the joint embeddings space to decide whether two sentences are parallel. The reader is referred to (artetxe_massively_2018) and (artetxe_margin-based_2018) for a detailed description. Table 9 in Appendix A.2 gives statistics on the number of parallel sentences mined for all language pairs.

Target language choice

We choose English as our source language as it has the largest Wikipedia, and to easily find crowd workers. We choose six other languages with a sufficiently large Wikipedia and which represent a broad range of linguistic phenomena.

de es ar zh vi hi
5.4M 1.1M 83.7k 24.1K 9.2k 1340
Table 1: Incremental alignment with English to obtain 7-way aligned sentences.

We first independently align all languages with English, then intersect these sets of parallel sentences, forming sets of N-way parallel sentences. As shown in Table 1, starting with 5.4M parallel English/German sentences, the number of N-way parallel sentences quickly decreases as more languages are added. In addition, we found that 7-way parallel sentences generally lack linguistic diversity, and often appear in the first sentence or paragraph of an article.

As a compromise between language-parallelism and both the number and diversity of parallel sentences, we use sentences that are 4-way parallel. This yielded 385,396 mined parallel sentences (see Table 9 in Appendix A.2) which were sub-sampled to ensure parallel sentences were evenly distributed in paragraphs, and target languages equally represented.

We ensure that each language combination was equally represented, so that each target language has many QA instances in common with all the other target languages. Except for any losses due to rejected instances later in the pipeline, each QA instance we eventually create will be parallel between English and three target languages.

2.2 English QA annotation



Figure 2: English QA annotation interface screenshot

We use the Amazon Mechanical Turk crowd-sourcing platform to annotate English QA instances, broadly following the methodology of rajpurkar_squad:_2016. We present the worker with an English aligned sentence, ben along with the paragraph that contains it, context cen. We ask the workers to formulate a question qen and to highlight the shortest answer span aen that answers it. aen must be be a subspan of ben to ensure qen will be answerable in the target languages. We include a “No Question Possible” button when no sensible question could be asked. A screenshot of the annotation interface is shown in Figure 2. The first 15 questions produced by each worker are screened for quality, after which annotations are either auto-approved or workers were contacted for further instruction.

Once the questions and answers have been annotated, we start another task to re-annotate English answers. Here, workers are presented with qen and cen, and requested to highlight the shortest span aen that best answers qen or to indicate that qen is not not answerable. Two additional answer span annotations are collected for each question.

The additional answer annotations enable us to calculate an inter-annotator agreement (IAA) score. Here, we calculate the mean token F1 score between the three answer annotations, giving an IAA score of 82%, comparable to the SQuAD v1.1 development set, where this IAA measure is 84%.

Rather than provide all three answer annotations as reference answers, we instead select a single representative reference answer. In 88% of cases, two or three of the answers exactly match, so the majority answer is selected. In the remaining cases, the answer with highest F1 word overlap with the other two is chosen. This results in an accurate answer, and ensures the English results are comparable to those in the target languages, where only one answer is annotated per question.

Instances where any annotator marked the question as unanswerable are discarded. Additionally, any instances where over 50% of the question appeared as sub-sequence of the aligned sentence are discarded, as these are excessively easy or of poor quality. Finally, we reject questions where the answer IAA score was very low (< 0.3) removing a small number of low quality instances. This results in 12,951 English QA instances, which were then sent for target language annotation.

2.3 Target language QA annotation

We use the One Hour Translation platform to source professional translators to translate the questions from English to the six target languages, and to find the answer in the target context. We present each translator with the English question qen, English answer aen, and the context cx (containing aligned sentence bx) in target language x.

The translators are only shown the aligned sentence and the sentence on each side (where these exist). This increases the chance of the question being answerable, as in some cases the aligned sentences are not perfect mutual translations, without requiring workers to read the entire context cx. OpenCC is used to convert all Chinese contexts to Simplified Chinese.44 4 We ask the translators to select the minimal and closest span that answers the question based on the English answer. By providing the English answer we try to minimize cultural and personal differences in the amount of detail included in the answer.

2.4 “No Answer” Annotations

We discard instances in target languages where the answer annotator indicated there was no answer for that language. This means that some instances are no longer 4-way parallel. “No Answer” annotations occurred for between 6.6% and 21.9% of instances (Vietnamese and German, respectively). We release these “No Answer” instances separately as an additional resource, but do not consider them in our experiments or analysis.

2.5 The resulting MLQA corpus

Contexts, questions and answer spans for all the languages are then brought together to create the final MLQA corpus. MLQA consists of 12,738 extractive QA instances in English and between 5,029 and 6,006 instances in the target languages. 9,019 instances are 4-way parallel, 2,930 are 3-way parallel and 789 2-way parallel. MLQA is split into development and test splits, with detailed statistics in Tables 23 and 4, and representative examples in Figure 3. Vietnamese has the longest white-space tokenized contexts on average, whilst German are shortest, but all languages have a substantial tail of long contexts. Other than Chinese, answers are on average between 3 and 4 tokens.

fold en de es ar zh vi hi
dev 1148 512 500 517 504 511 507
test 11590 4517 5253 5335 5137 5495 4918
Table 2: Number of instances per language in MLQA.
de es ar zh vi hi
de 5029
es 1972 5753
ar 1856 2139 5852
zh 1811 2108 2100 5641
vi 1857 2207 2210 2127 6006
hi 1593 1910 2017 2124 2124 5425
Table 3: Number of parallel instances between target language pairs (all instances are parallel with English).
en de es ar zh* vi hi
Context 157.5 102.2 103.4 116.8 222.9 195.1 141.5
Question 8.4 7.7 8.6 7.6 14.3 10.6 9.3
Answer 3.1 3.2 4.1 3.4 8.2 4.5 3.6
Table 4: Mean Sequence lengths (tokens) in MLQA. *calculated with mixed segmentation (Section 4.1)
Figure 3: (a) MLQA example parallel for En-De-Ar-Vi. (b) MLQA example parallel for En-Es-Zh-Hi. Answers shown as highlighted spans in contexts. Contexts shortened for clarity with “[…]”.

3 Related Work

There is a great variety of QA datasets in English, popularized with the introduction of MCTest (richardson_mctest:_2013), CNN/Daily Mail (hermann_teaching_2015) CBT (hill_goldilocks_2016), and WikiQA (yang_wikiqa:_2015) amongst others. Large span-based machine comprehension datasets such as SQuAD (rajpurkar_squad:_2016), TriviaQA (joshi_triviaqa:_2017), NewsQA (trischler_newsqa:_2016), SQuAD 2.0 (rajpurkar_know_2018) and Natural Questions (kwiatkowski_natural_2019) have seen extractive, span-based question answering become a dominant paradigm in QA.

However, large, high quality QA datasets in other languages are comparatively rare. There are several resources in Chinese, such as DUReader (he_dureader:_2018), CMRC (cui_span-extraction_2019) and DRCD (shao_drcd:_2018), and more recently there have been efforts to generate datasets in a wider array of languages, such as Korean (Lim et al., 2019) and Arabic (mozannar_neural_2019).

Cross-lingual QA as a discipline has been explored in QA for RDF data for a number of years, with notable examples being the QALD-3 (cimiano_multilingual_2013) and QALD-5 tracks (unger_question_2015), and with more recent work from zimina_mug-qa:_2018.

The last year or so has seen interest in cross-lingual QA increase. lee_semi-supervised_2018 explore an approach to use English QA data from SQuAD to improve QA performance in Korean using an in-language seed dataset. kumar_cross-lingual_2019 study the related task of question generation by leveraging English questions to generate better questions in Hindi, and lee_cross-lingual_2019 and cui_cross-lingual_2019 develop modelling approaches to improve performance on Chinese QA tasks using English resources. lee_learning_2019 and hsu_zero-shot_2019 explore modelling approaches for zero-shot language transfer for QA and singh_xlda:_2019 explore how training with cross-lingual data can regularize QA and NLI models.

The increasing interest in cross-lingual QA has also seen a number of datasets recently released. gupta_mmqa:_2018 release a parallel QA dataset in English and Hindi, hardalov_beyond_2019 investigate zero-shot transfer for QA from English to Bulgarian, liu_xcmrc:_2019 released a cloze QA dataset in Chinese and English, and jing_bipar:_2019 have recently released “BiPar”, consisting of parallel paragraphs from novels in English and Chinese. These datasets have a similar spirit as MLQA, but are limited to two languages.

asai_multilingual_2018 investigate extractive QA on small, purpose-built manually-translated subset of 327 SQuAD instances in Japanese and French, and develop a phrase-alignment “Translate-Test” modelling technique, showing improvements over a naïve back-translation baseline. Like MLQA, they focus on extractive QA, and build multi-way parallel data, but MLQA has many more instances, covers more languages and does not require manual document translation of English contexts. liu_xqa:_2019 explore cross-lingual open-domain QA with a kind of cloze-style QA dataset built from Wikipedia “Did you know?” factoid questions, and covers nine languages. However, unlike MLQA it is distantly supervised, the dataset size varies greatly by language, the instances are not parallel across languages, and answer distributions vary by language, making quantitative comparisons challenging.

Finally, in contemporaneous work to MLQA, artetxe_cross-lingual_2019 release XQuAD, a dataset of 1190 SQuAD instances from 240 paragraphs by fully-manually translating the instances into ten languages. MLQA covers fewer languages than XQuAD, but contains 5x more QA pairs and 20x more context paragraphs per language, and uses original contexts as they appear in Wikipedia rather than manual translation from English.

4 Cross-lingual Question Answering Experiments

We introduce two tasks to assess cross-lingual QA performance with MLQA.

The first task, cross-lingual transfer (XLT), requires training a model with (cx,qx,ax) training data in language x, in our case English. Development data in language x is used for early-stopping and tuning. At test time, the model will be tested to extract ay in language y given context cy and question qy.

The second task, generalized cross-lingual transfer (G-XLT), is trained in the same way as for the first task, but at test time, the model is required to extract answer az from context cz in language z given question qy in language y. This evaluation setup is only possible because MLQA is question-parallel, allowing us to swap qz for qy for parallel instances in languages x and y without changing the meaning of the question.

Since MLQA only consists of development and test splits, we adopt SQuAD v1.1 as training data in our baseline models. We use MLQA-en as our development split, and focus on zero-shot evaluation, where no training or development data is available in target languages. We establish a number of baselines to assess current capabilities for cross-lingual QA for our tasks:


We translate paragraphs and questions from the SQuAD training set into the target language using a machine-translation system.55 5 We use Facebook’s production translation models. Prior to translating, we enclose the answer span in quotes, as was done by lee_semi-supervised_2018. This makes it simple to extract the answer from the translated paragraph, as well as encouraging the translation model to map the answer into a single span. We discard training instances where this fails (5% of questions). The resulting corpus is then used to train a model in the target language.


The context and question in the target language is translated into English at test time. We use our best English system to produce an answer span in the translated paragraph. For all languages other than Hindi,66 6 Alignments were not available for Hindi-English due to production model limitations. Instead we translate the English answers using another round of machine translation. As back-translated answers might not map back to a span in the original context, Translate-test performs poorly in this case. we use attention scores, aij, from the translation model to map the answer back to the original language. Rather than aligning spans by attention argmaxing, as by asai_multilingual_2018, we identify the span in the original context which maximizes F1 score with the English span, as follows:

RC=iSe,jSoaijiSeai*PR=iSe,jSoaijjSoa*jF1=2*RC*PRRC+PRanswer=argmaxSo F1(So) (1)

where Se and So are the English and original spans respectively, ai*=jaij and a*j=ia*j.

Cross-lingual Representation Models

We present zero-shot transfer results from multilingual BERT (cased, 104 languages) (devlin_bert:_2018) and XLM (MLM + TLM, 15 languages) (Lample and Conneau, 2019). A model is trained using the SQuAD training set and evaluated directly on the MLQA test set in the target language. Model selection is also constrained to be strictly zero-shot, using only the English development set to pick hyper-parameters. As a result, we end up with a single model that we test for all 7 languages.

All models were trained using the SQuAD v1.1 fine-tuning method from devlin_bert:_2018, performed using the Pytext NLP toolkit (Aly et al., 2018).

4.1 Evaluation Metrics for Multilingual QA

Most extractive QA tasks use Exact Match (EM) and mean token F1 score as performance metrics. The widely-used SQuAD evaluation script also performs the following answer-preprocessing operations: i) lowercasing, ii) stripping (ASCII) punctuation iii) stripping (English) articles and iv) whitespace tokenisation. We introduce the following modifications for fairer evaluation across languages: Instead of stripping ASCII punctuation, we strip all unicode characters with a punctuation General_Category.77 7 When a language has stand-alone articles (English, Spanish, German and Vietnamese) we strip them. We use whitespace tokenization for all of the MLQA languages other than Chinese, where we use the mixed segmentation method from CMRC2018, which splits by character for Chinese characters, but by whitespace for any code-switched English tokens (cui_span-extraction_2019). We make this multilingual evaluation script available with the MLQA data.

5 Results

5.1 XLT Results

Table 5 shows the results on the XLT task. Overall, XLM performs best, transferring best in Spanish, German and Arabic, and competitively with translate-train+M-BERT for Vietnamese and Chinese. XLM is however, weaker in English. Even for XLM, there is a 39.8% mean drop in Exact Match score (20.9% in F1) over the training-language BERT-large baseline, showing there is significant room for improvement. Arabic and Hindi present the biggest challenge for our models, whereas Spanish is the easiest to transfer to.

F1 / EM en es de ar hi vi zh
BERT-Large 80.2 / 67.4 - - - - - -
Multilingual-BERT 77.7 / 65.2 64.3 / 46.6 57.9 / 44.3 45.7 / 29.8 43.8 / 29.7 57.1 / 38.6 57.5 / 37.3
XLM 74.9 / 62.4 68.0 / 49.8 62.2 / 47.6 54.8 / 36.3 48.8 / 27.3 61.4 / 41.8 61.1 / 39.6
Translate test, BERT-L - 65.4 / 44.0 57.9 / 41.8 33.6 / 20.4 23.8 / 18.9* 58.2 / 33.2 44.2 / 20.3
Translate train, M-BERT - 53.9 / 37.4 62.0 / 47.5 51.8 / 33.2 55.0 / 40.0 62.0 / 43.1 61.4 / 39.5
Translate train, XLM - 65.2 / 47.8 61.4 / 46.7 54.0 / 34.4 50.7 / 33.4 59.3 / 39.4 59.8 / 37.9
Table 5: F1 score and Exact Match on the MLQA test set for the cross-lingual transfer task (XLT)

A manual analysis of errors from the XLM model where the model failed to exactly match the reference answer was carried out for all target languages. 38% of these errors come from completely wrong answers, 5% from annotation errors and 7.5% from acceptable answers with no overlap with the golden answer. The remaining 49% of errors come from answers that partially overlap with the golden span. These span differences are often large, suggesting the model has difficulty accurately modelling answer spans in the target languages.

To explore whether questions that were difficult for the model in English were also challenging in the target languages, we split MLQA into two subsets where the XLM model either got an F1 score of zero in English (15% of the data) or greater than zero (the remaining 85%). Figure 4 shows that whilst the model’s transfer performance is better when the English answer is correct, the performance is non-trivial when the English answer is wrong, suggesting some questions are easier to answer in some languages than others.



Figure 4: F1 scores stratified by English performance

5.2 G-XLT Results

Results for M-BERT on the G-XLT task are shown in Table 6. When given a question in a target language, the best performance is always attained by using an English context. When given a context in a target language, English questions perform best for Spanish, German, and Arabic, but Hindi, Vietnamese and Chinese perform best when the questions match the context language.

q/c en es de ar hi vi zh
en 77.7 64.4 62.7 45.7 40.1 52.2 54.2
es 67.4 64.3 58.5 44.1 38.1 48.2 51.1
de 62.8 57.4 57.9 38.8 35.5 44.7 46.3
ar 51.2 45.3 46.4 45.6 32.1 37.3 40.0
hi 51.8 43.2 46.2 36.9 43.8 38.4 40.5
vi 61.4 52.1 51.4 34.4 35.1 57.1 47.1
zh 58.0 49.1 49.6 40.5 36.0 44.6 57.5
Table 6: MLQA Test F1 for M-BERT for generalized cross-lingual transfer (G-XLT). Rows indicate question language, columns indicate context language.

5.3 Comparisons between SQuAD v1.1 and MLQA-English

The MLQA-en results in Table 5 are lower than reported results on SQuAD v1.1 in the literature for equivalent models. However, once SQuAD scores are adjusted to reflect only having a single answer annotation (picked using the same method used to pick MLQA answers), the discrepancy drops to 5.8% on average (see Table 7). MLQA-en contexts are also on average 28% longer than SQuAD’s, and MLQA covers a much wider set of articles than SQuAD, which was built from a small subset of curated articles. Minor differences in preprocessing and answer lengths may also contribute to performance differences (MLQA-en answers are slightly longer, 3.1 tokens vs 2.9 on average). Question type distributions are very similar in both datasets (Figure 5 in Appendix A)

Model Dataset F1 EM
BERT-Large SQuAD 91.03 80.89
SQuAD* 84.80 72.85
MLQA-en 80.22 67.37
M-BERT SQuAD 88.46 81.22
SQuAD* 82.98 71.17
MLQA-en 77.74 65.18
XLM SQuAD 87.62 80.52
SQuAD* 82.14 69.73
MLQA-en 74.89 62.38
Table 7: English results comparisons using our models. * uses a single answer annotation.

6 Discussion

It is worth discussing the quality of context paragraphs in MLQA. Our parallel sentence mining approach can source independently-written documents in different languages, but, in practice, articles are often translated from English to the target languages by volunteers. Thus our method often acts as an efficient mechanism of sourcing existing human translations, rather than sourcing independently-written content on the same topic. The use of machine translation is strongly discouraged by the Wikipedia community,88 8 but from examining edit histories of articles in MLQA, machine translation is occasionally used as a seed for an article, before being edited and added to by human authors.

Our annotation procedure restricts answers to come from specific sentences in context. Despite being provided with several sentences of context, some annotators may be tempted to only focus on the aligned sentence and generate questions which only require a single sentence of context to answer. However, single sentence context questions are a known issue with the SQuAD-style annotation protocols in general (sugawara_what_2018) suggesting our annotation protocol would not result in less challenging questions. This is borne out in the performance on MLQA-en being similar to SQuAD v1.1 (Section 5.3).

MLQA is partitioned into development and test splits. Due to its parallel nature, this means there is development data for every language. As MLQA will be freely available, this was done to reduce the chance of test data becoming over-fit in future, and to establish standard splits for future work. However, in our experiments, we only make use of the English MLQA development data and study strict zero-shot settings. Many other evaluation setups could be envisioned, e.g. by exploiting the target language development sets for hyper-parameter optimisation or fine-tuning, which could be a fruitful direction for higher transfer performance, but we leave such “few-shot” experiments as future work. Other potential avenues to explore involve training with corpora and languages other than SQuAD and English, such as CMRC (cui_span-extraction_2018), or using unsupervised QA techniques to assist transfer to target languages (lewis_unsupervised_2019).

There is a large body of evidence to suggest that extractive QA models are over-reliant on word-matching between the question and context (jia_adversarial_2017; gan_improving_2019). The G-XLT task thus represents an interesting test-bed, as simple symbolic word matching is much less straightforward, as questions and contexts use different languages. However, the performance on G-XLT is relatively strong, especially if either the question or context is in English. This suggests that the word-matching behaviour between context and question in cross-lingual models are more nuanced and sophisticated than it would initially appear, and the model’s bigger weakness is in accurately modeling spans in the language of the context document.

7 Conclusion

In this work we have introduced MLQA, a highly-parallel multi-lingual question answering benchmark in seven languages. We developed several baselines on two cross-lingual understanding tasks on MLQA based on state-of-the-art methods, and demonstrate significant room for improvement. We hope that MLQA will help to catalyse work in cross-lingual QA in order to close the gap between training and testing language performance.

8 Acknowledgements

The authors would like to acknowledge their crowd-working and translation colleagues for their work on MLQA. The authors would also like to thank Yuxiang Wu, Andres Compara Nuñez, Kartikay Khandelwal, Nikhil Gupta, Chau Tran, Ahmed Kishky, Haoran Li, Tamar Lavee, and Ves Stoyanov for their feedback and comments.


  • A. Aly, K. Lakhotia, S. Zhao, M. Mohit, B. Oguz, A. Arora, S. Gupta, C. Dewan, S. Nelson-Lindall, and R. Shah (2018) PyText: a seamless path from nlp research to production. arXiv preprint arXiv:1812.08729. Cited by: §4.
  • G. Lample and A. Conneau (2019) Cross-lingual language model pretraining. External Links: Link Cited by: §1, §4.
  • D. D. Lewis, Y. yang, T. G. Rose, and F. Li (2004) RCV1: a new benchmark collection for text categorization research. jmlr 5, pp. 361–397. Cited by: §1.
  • S. Lim, M. Kim, and J. Lee (2019) KorQuAD1.0: korean qa dataset for machine reading comprehension. arXiv:1909.07005v2 [cs.CL], pp. . External Links: Link Cited by: §3.
  • H. Schwenk, V. Chaudhary, S. Sun, H. Gong, and F. Guzmán (2019) WikiMatrix: mining 135m parallel sentences in 1620 language pairs from Wikipedia. Cited by: §2.1.
  • H. Schwenk and X. Li (2018) A corpus for multilingual document classification in eight languages. In lrec, pp. 3548–3551. Cited by: §1.

Supplementary Materials for
MLQA: Evaluating Cross-lingual Extractive Question Answering

Appendix A Appendices

A.1 Additional MLQA Statistics

Figure 5 shows the distribution of wh words in questions in both MLQA-en and SQuAD v.1.1. The distributions are very similar, suggesting training on SQuAD data is an appropriate training dataset choice.



Figure 5: Question type distribution (by “wh” word) in MLQA-en and SQuAD V1.1.

Table 8 shows the number of Wikipedia articles that feature at least one of their paragraphs as a context paragraph in MLQA, along with the number of unique context paragraphs in MLQA. There 1.9 context paragraphs from each article on average. This is in contrast to SQuAD, which instead features a small number of curated articles, but much more densely annotated, with 43 context paragraphs per article on average. Thus, MLQA covers a much broader range of topics than SQuAD.

en ar de es hi zh vi
# Articles 5530 2627 2806 2762 2255 2673 2682
# Contexts 10894 5085 4509 5215 4524 4989 5246
# Instances 12738 5852 5029 5753 5425 4989 6006
Table 8: Number of Wikipedia articles with a context in MLQA.

A.2 Further details on Parallel Sentence mining

Table 9 shows the number of mined parallel sentences found in each language, as function of how many languages the sentences are parallel between. As the number of languages that a parallel sentence is shared between increases, the number of such sentences decreases. When we look for 7-way aligned examples, we only find 1340 sentences from the entirety of the 7 Wikipedia. Additionally, most of these sentences are the first sentence of the article, or are very uninteresting. However, if we choose 4-way parallel sentences, there are plenty of sentences to choose from.

N-way en de es ar zh vi hi
2 12219436 3925542 4957438 1047977 1174359 904037 210083
3 2143675 1157009 1532811 427609 603938 482488 83495
4 385396 249022 319902 148348 223513 181353 34050
5 73918 56756 67383 44684 58814 54884 13151
6 12333 11171 11935 11081 11485 11507 4486
7 1340 1340 1340 1340 1340 1340 1340
Table 9: Number of mined parallel sentences as a function of how many languages the sentences are parallel between