IndicBART: A Pre-trained Model for Natural Language Generation of Indic Languages

Abstract

In this paper we present IndicBART, a multilingual, sequence-to-sequencepre-trained model focusing on 11 Indic languages and English. Different fromexisting pre-trained models, IndicBART utilizes the orthographic similaritybetween Indic scripts to improve transfer learning between similar Indiclanguages. We evaluate IndicBART on two NLG tasks: Neural Machine Translation(NMT) and extreme summarization. Our experiments on NMT for 12 language pairsand extreme summarization for 7 languages using multilingual fine-tuning showthat IndicBART is competitive with or better than mBART50 despite containingsignificantly fewer parameters. Our analyses focus on identifying the impact ofscript unification (to Devanagari), corpora size as well as multilingualism onthe final performance. The IndicBART model is available under the MIT licenseat https://indicnlp.ai4bharat.org/indic-bart .

Quick Read (beta)

loading the full paper ...