IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages

  • 2023-05-25 18:57:43
  • AI4Bharat, Jay Gala, Pranjal A. Chitale, Raghavan AK, Sumanth Doddapaneni, Varun Gumma, Aswanth Kumar, Janki Nawale, Anupama Sujatha, Ratish Puduppully, Vivek Raghavan, Pratyush Kumar, Mitesh M. Khapra, Raj Dabre, Anoop Kunchukuttan
India has a rich linguistic landscape with languages from 4 major languagefamilies spoken by over a billion people. 22 of these languages are listed inthe Constitution of India (referred to as scheduled languages) are the focus ofthis work. Given the linguistic diversity, high-quality and accessible MachineTranslation (MT) systems are essential in a country like India. Prior to thiswork, there was (i) no parallel training data spanning all the 22 languages,(ii) no robust benchmarks covering all these languages and containing contentrelevant to India, and (iii) no existing translation models which support allthe 22 scheduled languages of India. In this work, we aim to address this gapby focusing on the missing pieces required for enabling wide, easy, and openaccess to good machine translation systems for all 22 scheduled Indianlanguages. We identify four key areas of improvement: curating and creatinglarger training datasets, creating diverse and high-quality benchmarks,training multilingual models, and releasing models with open access. Our firstcontribution is the release of the Bharat Parallel Corpus Collection (BPCC),the largest publicly available parallel corpora for Indic languages. BPCCcontains a total of 230M bitext pairs, of which a total of 126M were newlyadded, including 644K manually translated sentence pairs created as part ofthis work. Our second contribution is the release of the first n-way parallelbenchmark covering all 22 Indian languages, featuring diverse domains,Indian-origin content, and source-original test sets. Next, we presentIndicTrans2, the first model to support all 22 languages, surpassing existingmodels on multiple existing and new benchmarks created as a part of this work.Lastly, to promote accessibility and collaboration, we release our models andassociated data with permissive licenses at


