Accuracy of the Uzbek stop words detection: a case study on "School corpus"

  • 2022-09-15 06:14:31
  • Khabibulla Madatov, Shukurla Bekchanov, Jernej Vičič
  • 1


Stop words are very important for information retrieval and text analysisinvestigation tasks of natural language processing. Current work presents amethod to evaluate the quality of a list of stop words aimed at automaticallycreating techniques. Although the method proposed in this paper was tested onan automatically-generated list of stop words for the Uzbek language, it canbe, with some modifications, applied to similar languages either from the samefamily or the ones that have an agglutinative nature. Since the Uzbek languagebelongs to the family of agglutinative languages, it can be explained that theautomatic detection of stop words in the language is a more complex processthan in inflected languages. Moreover, we integrated our previous work on stopwords detection in the example of the "School corpus" by investigating how toautomatically analyse the detection of stop words in Uzbek texts. This work isdevoted to answering whether there is a good way of evaluating available stopwords for Uzbek texts, or whether it is possible to determine what part of theUzbek sentence contains the majority of the stop words by studying thenumerical characteristics of the probability of unique words. The results showacceptable accuracy of the stop words lists.


