Stemming -- The Evolution and Current State with a Focus on Bangla

  • 2025-08-21 16:54:24
  • Abhijit Paul, Mashiat Amin Farin, Sharif Md. Abdullah, Ahmedul Kabir, Zarif Masud, Shebuti Rayana
  • 0

Abstract

Bangla, the seventh most widely spoken language worldwide with 300 millionnative speakers, faces digital under-representation due to limited resourcesand lack of annotated datasets. Stemming, a critical preprocessing step inlanguage analysis, is essential for low-resource, highly-inflectional languageslike Bangla, because it can reduce the complexity of algorithms and models bysignificantly reducing the number of words the algorithm needs to consider.This paper conducts a comprehensive survey of stemming approaches, emphasizingthe importance of handling morphological variants effectively. While exploringthe landscape of Bangla stemming, it becomes evident that there is asignificant gap in the existing literature. The paper highlights thediscontinuity from previous research and the scarcity of accessibleimplementations for replication. Furthermore, it critiques the evaluationmethodologies, stressing the need for more relevant metrics. In the context ofBangla's rich morphology and diverse dialects, the paper acknowledges thechallenges it poses. To address these challenges, the paper suggests directionsfor Bangla stemmer development. It concludes by advocating for robust Banglastemmers and continued research in the field to enhance language analysis andprocessing.

 

Quick Read (beta)

loading the full paper ...