MLQA: Evaluating Cross-lingual Extractive Question Answering

Abstract

Question answering (QA) models have shown rapid progress enabled by theavailability of large, high-quality benchmark datasets. Such annotated datasetsare difficult and costly to collect, and rarely exist in languages other thanEnglish, making training QA systems in other languages challenging. Analternative to building large monolingual training datasets is to developcross-lingual systems which can transfer to a target language without requiringtraining data in that language. In order to develop such systems, it is crucialto invest in high quality multilingual evaluation benchmarks to measureprogress. We present MLQA, a multi-way aligned extractive QA evaluationbenchmark intended to spur research in this area. MLQA contains QA instances in7 languages, namely English, Arabic, German, Spanish, Hindi, Vietnamese andSimplified Chinese. It consists of over 12K QA instances in English and 5K ineach other language, with each QA instance being parallel between 4 languageson average. MLQA is built using a novel alignment context strategy on Wikipediaarticles, and serves as a cross-lingual extension to existing extractive QAdatasets. We evaluate current state-of-the-art cross-lingual representations onMLQA, and also provide machine-translation-based baselines. In all cases,transfer results are shown to be significantly behind training-languageperformance.

Quick Read (beta)

loading the full paper ...