Abstract
While topic modeling in English has become a prevalent and well-exploredarea, venturing into topic modeling for Indic languages remains relativelyrare. The limited availability of resources, diverse linguistic structures, andunique challenges posed by Indic languages contribute to the scarcity ofresearch and applications in this domain. Despite the growing interest innatural language processing and machine learning, there exists a noticeable gapin the comprehensive exploration of topic modeling methodologies tailoredspecifically for languages such as Hindi, Marathi, Tamil, and others. In thispaper, we examine several topic modeling approaches applied to the Marathilanguage. Specifically, we compare various BERT and non-BERT approaches,including multilingual and monolingual BERT models, using topic coherence andtopic diversity as evaluation metrics. Our analysis provides insights into theperformance of these approaches for Marathi language topic modeling. The keyfinding of the paper is that BERTopic, when combined with BERT models trainedon Indic languages, outperforms LDA in terms of topic modeling performance.