LakotaBERT: A Transformer-based Model for Low Resource Lakota Language

  • 2025-03-23 22:31:12
  • Kanishka Parankusham, Rodrigue Rizk, KC Santosh
  • 0

Abstract

Lakota, a critically endangered language of the Sioux people in NorthAmerica, faces significant challenges due to declining fluency among youngergenerations. This paper introduces LakotaBERT, the first large language model(LLM) tailored for Lakota, aiming to support language revitalization efforts.Our research has two primary objectives: (1) to create a comprehensive Lakotalanguage corpus and (2) to develop a customized LLM for Lakota. We compiled adiverse corpus of 105K sentences in Lakota, English, and parallel texts fromvarious sources, such as books and websites, emphasizing the culturalsignificance and historical context of the Lakota language. Utilizing theRoBERTa architecture, we pre-trained our model and conducted comparativeevaluations against established models such as RoBERTa, BERT, and multilingualBERT. Initial results demonstrate a masked language modeling accuracy of 51%with a single ground truth assumption, showcasing performance comparable tothat of English-based models. We also evaluated the model using additionalmetrics, such as precision and F1 score, to provide a comprehensive assessmentof its capabilities. By integrating AI and linguistic methodologies, we aspireto enhance linguistic diversity and cultural resilience, setting a valuableprecedent for leveraging technology in the revitalization of other endangeredindigenous languages.

 

Quick Read (beta)

loading the full paper ...