BanglaBERT: Combating Embedding Barrier for Low-Resource Language Understanding

Abstract

Pre-training language models on large volume of data with self-supervisedobjectives has become a standard practice in natural language processing.However, most such state-of-the-art models are available in only English andother resource-rich languages. Even in multilingual models, which are trainedon hundreds of languages, low-resource ones still remain underrepresented.Bangla, the seventh most widely spoken language in the world, is still low interms of resources. Few downstream task datasets for language understanding inBangla are publicly available, and there is a clear shortage of good qualitydata for pre-training. In this work, we build a Bangla natural languageunderstanding model pre-trained on 18.6 GB data we crawled from top Banglasites on the internet. We introduce a new downstream task dataset and benchmarkon four tasks on sentence classification, document classification, naturallanguage understanding, and sequence tagging. Our model outperformsmultilingual baselines and previous state-of-the-art results by 1-6%. In theprocess, we identify a major shortcoming of multilingual models that hurtperformance for low-resource languages that don't share writing scripts withany high resource one, which we name the `Embedding Barrier'. We performextensive experiments to study this barrier. We release all our datasets andpre-trained models to aid future NLP research on Bangla and other low-resourcelanguages. Our code and data are available athttps://github.com/csebuetnlp/banglabert.

Quick Read (beta)

loading the full paper ...