El Departamento de Nosotros: How Machine Translated Corpora Affects Language Models in MRC Tasks

Abstract

Pre-training large-scale language models (LMs) requires huge amounts of textcorpora. LMs for English enjoy ever growing corpora of diverse languageresources. However, less resourced languages and their mono- and multilingualLMs often struggle to obtain bigger datasets. A typical approach in this caseimplies using machine translation of English corpora to a target language. Inthis work, we study the caveats of applying directly translated corpora forfine-tuning LMs for downstream natural language processing tasks anddemonstrate that careful curation along with post-processing lead to improvedperformance and overall LMs robustness. In the empirical evaluation, we performa comparison of directly translated against curated Spanish SQuAD datasets onboth user and system levels. Further experimental results on XQuAD and MLQAtransfer-learning evaluation question answering tasks show that presumablymultilingual LMs exhibit more resilience to machine translation artifacts interms of the exact match score.

Quick Read (beta)

loading the full paper ...