Abstract
- The field of natural language processing (NLP) has dramatically expandedwithin the last decade. Many human-being applications are conducted daily viaNLP tasks, starting from machine translation, speech recognition, textgeneration and recommendations, Part-of-Speech tagging (POS), and Named-EntityRecognition (NER). However, low-resourced languages, such as theCentral-Kurdish language (CKL), mainly remain unexamined due to shortage ofnecessary resources to support their development. The POS tagging task is thebase of other NLP tasks; for example, the POS tag set has been used tostandardized languages to provide the relationship between words among thesentences, followed by machine translation and text recommendation.Specifically, for the CKL, most of the utilized or provided POS tagsets areneither standardized nor comprehensive. To this end, this study presented anaccurate and comprehensive POS tagset for the CKL to provide better performanceof the Kurdish NLP tasks. The article also collected most of the POS tags fromdifferent studies as well as from Kurdish linguistic experts to standardizedpart-of-speech tags. The proposed POS tagset is designed to annotate a largeCKL corpus and support Kurdish NLP tasks. The initial investigations of thisstudy via comparison with the Universal Dependencies framework for standardlanguages, show that the proposed POS tagset can streamline or correctsentences more accurately for Kurdish NLP tasks.