Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

Abstract

We present Samanantar, the largest publicly available parallel corporacollection for Indic languages. The collection contains a total of 49.7 millionsentence pairs between English and 11 Indic languages (from two languagefamilies). Specifically, we compile 12.4 million sentence pairs from existing,publicly-available parallel corpora, and additionally mine 37.4 millionsentence pairs from the web, resulting in a 4x increase. We mine the parallelsentences from the web by combining many corpora, tools, and methods: (a)web-crawled monolingual corpora, (b) document OCR for extracting sentences fromscanned documents, (c) multilingual representation models for aligningsentences, and (d) approximate nearest neighbor search for searching in a largecollection of sentences. Human evaluation of samples from the newly minedcorpora validate the high quality of the parallel sentences across 11languages. Further, we extract 83.4 million sentence pairs between all 55 Indiclanguage pairs from the English-centric parallel corpus using English as thepivot language. We trained multilingual NMT models spanning all these languageson Samanantar, which outperform existing models and baselines on publiclyavailable benchmarks, such as FLORES, establishing the utility of Samanantar.Our data and models are available publicly athttps://indicnlp.ai4bharat.org/samanantar/ and we hope they will help advanceresearch in NMT and multilingual NLP for Indic languages.

Quick Read (beta)

loading the full paper ...