WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia

  • 2019-07-10 23:57:30
  • Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong, Francisco Guzmán
  • 15

Abstract

We present an approach based on multilingual sentence embeddings toautomatically extract parallel sentences from the content of WikiPedia articlesin 85 languages, including several dialects or low-resource languages. We donot limit the the extraction process to alignments with English, butsystematically consider all possible language pairs. In total, we are able toextract 135M parallel sentences for 1620 different language pairs, out of whichonly 34M are aligned with English. This corpus of parallel sentences is freelyavailable at https://github.com/facebookresearch/LASER/tasks/WikiMatrix. To getan indication on the quality of the extracted bitexts, we train neural MTbaseline systems on the mined data only for 1886 languages pairs, and evaluatethem on the TED corpus, achieving strong BLEU scores for many language pairs.The WikiMatrix bitexts seem to be particularly interesting to train MT systemsbetween distant languages without the need to pivot through English.

 

Quick Read (beta)

loading the full paper ...