A New Corpus for Low-Resourced Sindhi Language with Word Embeddings

Abstract

Representing words and phrases into dense vectors of real numbers whichencode semantic and syntactic properties is a vital constituent in naturallanguage processing (NLP). The success of neural network (NN) models in NLPlargely rely on such dense word representations learned on the large unlabeledcorpus. Sindhi is one of the rich morphological language, spoken by largepopulation in Pakistan and India lacks corpora which plays an essential role ofa test-bed for generating word embeddings and developing language independentNLP systems. In this paper, a large corpus of more than 61 million words isdeveloped for low-resourced Sindhi language for training neural wordembeddings. The corpus is acquired from multiple web-resources usingweb-scrappy. Due to the unavailability of open source preprocessing tools forSindhi, the prepossessing of such large corpus becomes a challenging problemspecially cleaning of noisy data extracted from web resources. Therefore, apreprocessing pipeline is employed for the filtration of noisy text.Afterwards, the cleaned vocabulary is utilized for training Sindhi wordembeddings with state-of-the-art GloVe, Skip-Gram (SG), and Continuous Bag ofWords (CBoW) word2vec algorithms. The intrinsic evaluation approach of cosinesimilarity matrix and WordSim-353 are employed for the evaluation of generatedSindhi word embeddings. Moreover, we compare the proposed word embeddings withrecently revealed Sindhi fastText (SdfastText) word representations. Ourintrinsic evaluation results demonstrate the high quality of our generatedSindhi word embeddings using SG, CBoW, and GloVe as compare to SdfastText wordrepresentations.

Quick Read (beta)

loading the full paper ...