BongLLaMA: LLaMA for Bangla Language

Abstract

Bangla (or "Bengali") is a language spoken by approximately 240 millionnative speakers and around 300 million people worldwide. Despite being the 5thlargest spoken language in the world, Bangla is still a "low-resource"language, and existing pretrained language models often struggle to performwell on Bangla Language Processing (BLP) tasks. This work addresses this gap byintroducing BongLLaMA (i.e., Bangla-LLaMA), an open-source large language modelfine-tuned exclusively on large Bangla corpora and instruction-tuning datasets.We present our methodology, data augmentation techniques, fine-tuning details,and comprehensive benchmarking results showcasing the utility of BongLLaMA onBLP tasks. We believe BongLLaMA will serve as the new standard baseline forBangla Language Models and, thus, facilitate future benchmarking studiesfocused on this widely-spoken yet "low-resource" language. All BongLLaMA modelsare available for public use at https://huggingface.co/BanglaLLM.

Quick Read (beta)

loading the full paper ...