Novel Language Resources for Hindi: An Aesthetics Text Corpus and a Comprehensive Stop Lemma List

Abstract

This paper is an effort to complement the contributions made by researchersworking toward the inclusion of non-English languages in natural languageprocessing studies. Two novel Hindi language resources have been created andreleased for public consumption. The first resource is a corpus consisting ofnearly thousand pre-processed fictional and nonfictional texts spanning overhundred years. The second resource is an exhaustive list of stop lemmas createdfrom 12 corpora across multiple domains, consisting of over 13 million words,from which more than 200,000 lemmas were generated, and 11 publicly availablestop word lists comprising over 1000 words, from which nearly 400 unique lemmaswere generated. This research lays emphasis on the use of stop lemmas insteadof stop words owing to the presence of various, but not all morphological formsof a word in stop word lists, as opposed to the presence of only the root formof the word, from which variations could be derived if required. It was alsoobserved that stop lemmas were more consistent across multiple sources ascompared to stop words. In order to generate a stop lemma list, the parts ofspeech of the lemmas were investigated but rejected as it was found that therewas no significant correlation between the rank of a word in the frequency listand its part of speech. The stop lemma list was assessed using a comparativemethod. A formal evaluation method is suggested as future work arising fromthis study.

Quick Read (beta)

loading the full paper ...