Abstract
Natural Language Understanding (NLU) for low-resource languages remains amajor challenge in NLP due to the scarcity of high-quality data andlanguage-specific models. Maithili, despite being spoken by millions, lacksadequate computational resources, limiting its inclusion in digital andAI-driven applications. To address this gap, we introducemaiBERT, a BERT-basedlanguage model pre-trained specifically for Maithili using the Masked LanguageModeling (MLM) technique. Our model is trained on a newly constructed Maithilicorpus and evaluated through a news classification task. In our experiments,maiBERT achieved an accuracy of 87.02%, outperforming existing regional modelslike NepBERTa and HindiBERT, with a 0.13% overall accuracy gain and 5-7%improvement across various classes. We have open-sourced maiBERT on HuggingFace enabling further fine-tuning for downstream tasks such as sentimentanalysis and Named Entity Recognition (NER).