BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla

Abstract

In this work, we introduce BanglaBERT, a BERT-based Natural LanguageUnderstanding (NLU) model pretrained in Bangla, a widely spoken yetlow-resource language in the NLP literature. To pretrain BanglaBERT, we collect27.5 GB of Bangla pretraining data (dubbed `Bangla2B+') by crawling 110 popularBangla sites. We introduce two downstream task datasets on natural languageinference and question answering and benchmark on four diverse NLU taskscovering text classification, sequence labeling, and span prediction. In theprocess, we bring them under the first-ever Bangla Language UnderstandingBenchmark (BLUB). BanglaBERT achieves state-of-the-art results outperformingmultilingual and monolingual models. We are making the models, datasets, and aleaderboard publicly available at https://github.com/csebuetnlp/banglabert toadvance Bangla NLP.

Quick Read (beta)

loading the full paper ...