Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

Abstract

We present Samanantar, the largest publicly available parallel corporacollection for Indic languages. The collection contains a total of 46.9 millionsentence pairs between English and 11 Indic languages (from two languagefamilies). In particular, we compile 12.4 million sentence pairs from existing,publicly-available parallel corpora, and we additionally mine 34.6 millionsentence pairs from the web, resulting in a 2.8X increase in publicly availablesentence pairs. We mine the parallel sentences from the web by combining manycorpora, tools, and methods. In particular, we use (a) web-crawled monolingualcorpora, (b) document OCR for extracting sentences from scanned documents (c)multilingual representation models for aligning sentences, and (d) approximatenearest neighbor search for searching in a large collection of sentences. Humanevaluation of samples from the newly mined corpora validate the high quality ofthe parallel sentences across 11 language pairs. Further, we extracted 82.7million sentence pairs between all 55 Indic language pairs from theEnglish-centric parallel corpus using English as the pivot language. We trainedmultilingual NMT models spanning all these languages on Samanantar and comparedwith other baselines and previously reported results on publicly availablebenchmarks. Our models outperform existing models on these benchmarks,establishing the utility of Samanantar. Our data(https://indicnlp.ai4bharat.org/samanantar) and models(https://github.com/AI4Bharat/IndicTrans) will be available publicly and wehope they will help advance research in Indic NMT and multilingual NLP forIndic languages.

Quick Read (beta)

loading the full paper ...