Can maiBERT Speak for Maithili?

  • 2025-09-22 16:46:34
  • Sumit Yadav, Raju Kumar Yadav, Utsav Maskey, Gautam Siddharth Kashyap, Md Azizul Hoque, Ganesh Gautam
  • 0

Abstract

Natural Language Understanding (NLU) for low-resource languages remains amajor challenge in NLP due to the scarcity of high-quality data andlanguage-specific models. Maithili, despite being spoken by millions, lacksadequate computational resources, limiting its inclusion in digital andAI-driven applications. To address this gap, we introducemaiBERT, a BERT-basedlanguage model pre-trained specifically for Maithili using the Masked LanguageModeling (MLM) technique. Our model is trained on a newly constructed Maithilicorpus and evaluated through a news classification task. In our experiments,maiBERT achieved an accuracy of 87.02%, outperforming existing regional modelslike NepBERTa and HindiBERT, with a 0.13% overall accuracy gain and 5-7%improvement across various classes. We have open-sourced maiBERT on HuggingFace enabling further fine-tuning for downstream tasks such as sentimentanalysis and Named Entity Recognition (NER).

 

Quick Read (beta)

loading the full paper ...