Representing words and phrases into dense vectors of real numbers whichencode semantic and syntactic properties is a vital constituent in naturallanguage processing (NLP). The success of neural network (NN) models in NLPlargely rely on such dense word representations learned on the large unlabeledcorpus. Sindhi is one of the rich morphological language, spoken by largepopulation in Pakistan and India lacks corpora which plays an essential role ofa test-bed for generating word embeddings and developing language independentNLP systems. In this paper, a large corpus of more than 61 million words isdeveloped for low-resourced Sindhi language for training neural wordembeddings. The corpus is acquired from multiple web-resources usingweb-scrappy. Due to the unavailability of open source preprocessing tools forSindhi, the prepossessing of such large corpus becomes a challenging problemspecially cleaning of noisy data extracted from web resources. Therefore, apreprocessing pipeline is employed for the filtration of noisy text.Afterwards, the cleaned vocabulary is utilized for training Sindhi wordembeddings with state-of-the-art GloVe, Skip-Gram (SG), and Continuous Bag ofWords (CBoW) word2vec algorithms. The intrinsic evaluation approach of cosinesimilarity matrix and WordSim-353 are employed for the evaluation of generatedSindhi word embeddings. Moreover, we compare the proposed word embeddings withrecently revealed Sindhi fastText (SdfastText) word representations. Ourintrinsic evaluation results demonstrate the high quality of our generatedSindhi word embeddings using SG, CBoW, and GloVe as compare to SdfastText wordrepresentations.
Quick Read (beta)
A New Corpus for Low-Resourced Sindhi Language with Word Embeddings
Representing words and phrases into dense vectors of real numbers which encode semantic and syntactic properties is a vital constituent in natural language processing (NLP). The success of neural network (NN) models in NLP largely rely on such dense word representations learned on the large unlabeled corpus. Sindhi is one of the rich morphological language, spoken by large population in Pakistan and India lacks corpora which plays an essential role of a test-bed for generating word embeddings and developing language independent NLP systems. In this paper, a large corpus of more than 61 million words is developed for low-resourced Sindhi language for training neural word embeddings. The corpus is acquired from multiple web-resources using web-scrappy. Due to the unavailability of open source preprocessing tools for Sindhi, the prepossessing of such large corpus becomes a challenging problem specially cleaning of noisy data extracted from web resources. Therefore, a preprocessing pipeline is employed for the filtration of noisy text. Afterwards, the cleaned vocabulary is utilized for training Sindhi word embeddings with state-of-the-art GloVe, Skip-Gram (SG), and Continuous Bag of Words (CBoW) word2vec algorithms. The intrinsic evaluation approach of cosine similarity matrix and WordSim-353 are employed for the evaluation of generated Sindhi word embeddings. Moreover, we compare the proposed word embeddings with recently revealed Sindhi fastText (SdfastText) word representations. Our intrinsic evaluation results demonstrate the high quality of our generated Sindhi word embeddings using SG, CBoW, and GloVe as compare to SdfastText word representations.
A New Corpus for Low-Resourced Sindhi Language with Word Embeddings
Wazir Ali, Jay Kumar, Junyu Lu, Zenglin Xu School of Computer Science and Engineering University of Electronic Science and Technology of China
December 2, 2019
Sindhi is a rich morphological, mutltiscript, and multidilectal language. It belongs to the Indo-Aryan language family , with significant cultural and historical background. Presently, it is recognized as is an official language  in Sindh province of Pakistan, also being taught as a compulsory subject in Schools and colleges. Sindhi is also recognized as one of the national languages in India. Ulhasnagar, Rajasthan, Gujarat, and Maharashtra are the largest Indian regions of Sindhi native speakers. It is also spoken in other countries except for Pakistan and India, where native Sindhi speakers have migrated, such as America, Canada, Hong Kong, British, Singapore, Tanzania, Philippines, Kenya, Uganda, and South, and East Africa. Sindhi has rich morphological structure  due to a large number of homogeneous words. Historically, it was written in multiple writing systems, which differ from each other in terms of orthography and morphology. The Persian-Arabic is the standard script of Sindhi, which was officially accepted in 1852 by the British government11 1 https://www.britannica.com/topic/Sindhi-language. However, the Sindhi-Devanagari is also a popular writing system in India being written in left to right direction like the Hindi language. Formerly, Khudabadi, Gujrati, Landa, Khojki, and Gurumukhi were also adopted as its writing systems. Even though, Sindhi has great historical and literal background, presently spoken by nearly 75 million people . The research on SNLP was coined in 200222 2 ”Sindhia lai Kampyutar jo Istemalu” (Use of computer for Sindhi), an article published in Sindhu yearly, Ulhasnagar. 2002, however, IT grabbed research attention after the development of its Unicode system . But still, Sindhi stands among the low-resourced languages due to the scarcity of core language processing resources of the raw and annotated corpus, which can be utilized for training robust word embeddings or the use of machine learning algorithms. Since the development of annotated datasets requires time and human resources.
The Language Resources (LRs) are fundamental elements for the development of high quality NLP systems based on automatic or NN based approaches. The LRs include written or spoken corpora, lexicons, and annotated corpora for specific computational purposes. The development of such resources has received great research interest for the digitization of human languages . Many world languages are rich in such language processing resources integrated in their software tools including English  , Chinese  and other languages  . The Sindhi language lacks the basic computational resources  of a large text corpus, which can be utilized for training robust word embeddings and developing language independent NLP applications including semantic analysis, sentiment analysis, parts of the speech tagging, named entity recognition, machine translation , multitasking , . Presently Sindhi Persian-Arabic is frequently used for online communication, newspapers, public institutions in Pakistan, and India . But little work has been carried out for the development of LRs such as raw corpus , , annotated corpus , , , . In the best of our knowledge, Sindhi lacks the large unlabelled corpus which can be utilized for generating and evaluating word embeddings for Statistical Sindhi Language Processing (SSLP).
One way to to break out this loop is to learn word embeddings from unlabelled corpora, which can be utilized to bootstrap other downstream NLP tasks. The word embedding is a new term of semantic vector space , distributed representations , and distributed semantic models. It is a language modeling approach  used for the mapping of words and phrases into -dimensional dense vectors of real numbers that effectively capture the semantic and syntactic relationship with neighboring words in a geometric way  . Such as “Einstein” and “Scientist” would have greater similarity compared with “Einstein” and “doctor.” In this way, word embeddings accomplish the important linguistic concept of “a word is characterized by the company it keeps”. More recently NN based models yield state-of-the-art performance in multiple NLP tasks   with the word embeddings. One of the advantages of such techniques is they use unsupervised approaches for learning representations and do not require annotated corpus which is rare for low-resourced Sindhi language. Such representions can be trained on large unannotated corpora, and then generated representations can be used in the NLP tasks which uses a small amount of labelled data.
In this paper, we address the problems of corpus construction by collecting a large corpus of more than 61 million words from multiple web resources using the web-scrappy framework. After the collection of the corpus, we carefully preprocessed for the filtration of noisy text, e.g., the HTML tags and vocabulary of the English language. The statistical analysis is also presented for the letter, word frequencies and identification of stop-words. Finally, the corpus is utilized to generate Sindhi word embeddings using state-of-the-art GloVe  SG and CBoW    algorithms. The popular intrinsic evaluation method    of calculating cosine similarity between word vectors and WordSim353  are employed to measure the performance of the learned Sindhi word embeddings. We translated English WordSim35333 3 Available online at https://rdrr.io/cran/wordspace/man/WordSim353.html word pairs into Sindhi using bilingual English to Sindhi dictionary. The intrinsic approach typically involves a pre-selected set of query terms  and semantically related target words, which we refer to as query words. Furthermore, we also compare the proposed word embeddings with recently revealed Sindhi fastText (SdfastText)44 4 We denote Sindhi word representations as (SdfastText) recently revealed by fastText, available at (https://fasttext.cc/docs/en/crawl-vectors.html) trained on Common Crawl and Wikipedia corpus of Sindhi Persian-Arabic.  word representations. To the best of our knowledge, this is the first comprehensive work on the development of large corpus and generating word embeddings along with systematic evaluation for low-resourced Sindhi Persian-Arabic. The synopsis of our novel contributions is listed as follows:
We present a large corpus of more than 61 million words obtained from multiple web resources and reveal a list of Sindhi stop words.
We develop a text cleaning pipeline for the preprocessing of the raw corpus.
Generate word embeddings using GloVe, CBoW, and SG Word2Vec algorithms also evaluate and compare them using the intrinsic evaluation approaches of cosine similarity matrix and WordSim353.
We are the first to evaluate SdfastText word representations and compare them with our proposed Sindhi word embeddings.
The remaining sections of the paper are organized as, Section 1 presents the literature survey regarding computational resources, Sindhi corpus construction, and word embedding models. Afterwards, Section 2 presents the employed methodology, Section 3 consist of statistical analysis of the developed corpus. Section 4 present the experimental setup. The intrinsic evaluation results along with comparison are given in Section 5. The discussion and future work are given in Section 6, and lastly, Section 7 presents the conclusion.
1 Related work
The natural language resources refer to a set of language data and descriptions  in machine readable form, used for building, improving, and evaluating NLP algorithms or softwares. Such resources include written or spoken corpora, lexicons, and annotated corpora for specific computational purposes. Many world languages are rich in such language processing resources integrated in the software tools including NLTK for English , Stanford CoreNLP , LTP for Chinese , TectoMT for German, Russian, Arabic  and multilingual toolkit . But Sindhi language is at an early stage for the development of such resources and software tools.
The corpus construction for NLP mainly involves important steps of acquisition, preprocessing, and tokenization. Initially,  discussed the morphological structure and challenges concerned with the corpus development along with orthographical and morphological features in the Persian-Arabic script. The raw and annotated corpus  for Sindhi Persian-Arabic is a good supplement towards the development of resources, including raw and annotated datasets for parts of speech tagging, morphological analysis, transliteration between Sindhi Persian-Arabic and Sindhi-Devanagari, and machine translation system. But the corpus is acquired only form Wikipedia-dumps. A survey-based study  provides all the progress made in the Sindhi Natural Language Processing (SNLP) with the complete gist of adopted techniques, developed tools and available resources which show that work on resource development on Sindhi needs more sophisticated efforts. The raw corpus is utilized for Sindhi word segmentation . More recently, an initiative towards the development of resources is taken  by open sourcing annotated dataset of Sindhi Persian-Arabic obtained from news and social blogs. The existing and proposed work is presented in Table 1 on the corpus development, word segmentation, and word embeddings, respectively.
The power of word embeddings in NLP was empirically estimated by proposing a neural language model  and multitask learning , but recently usage of word embeddings in deep neural algorithms has become integral element  for performance acceleration in deep NLP applications. The CBoW and SG   popular word2vec neural architectures yielded high quality vector representations in lower computational cost with integration of character-level learning on large corpora in terms of semantic and syntactic word similarity later extended  . Both approaches produce state-of-the-art accuracy with fast training performance, better representations of less frequent words and efficient representation of phrases as well.  proposed NN based approach for generating morphemic-level word embeddings, which surpassed all the existing embedding models in intrinsic evaluation. A count-based GloVe model  also yielded state-of-the-art results in an intrinsic evaluation and downstream NLP tasks.
The performance of Word embeddings is evaluated using intrinsic   and extrinsic evaluation  methods. The performance of word embeddings can be measured with intrinsic and extrinsic evaluation approaches. The intrinsic approach is used to measure the internal quality of word embeddings such as querying nearest neighboring words and calculating the semantic or syntactic similarity between similar word pairs. A method of direct comparison for intrinsic evaluation of word embeddings measures the neighborhood of a query word in vector space. The key advantage of that method is to reduce bias and create insight to find data-driven relevance judgment. An extrinsic evaluation approach is used to evaluate the performance in downstream NLP tasks, such as parts-of-speech tagging or named-entity recognition , but the Sindhi language lacks annotated corpus for such type of evaluation. Moreover, extrinsic evaluation is time consuming and difficult to interpret. Therefore, we opt intrinsic evaluation method  to get a quick insight into the quality of proposed Sindhi word embeddings by measuring the cosine distance between similar words and using WordSim353 dataset. A study reveals that the choice of optimized hyper-parameters  has a great impact on the quality of pretrained word embeddings as compare to desing a novel algorithm. Therefore, we optimized the hyperparameters for generating robust Sindhi word embeddings using CBoW, SG and GloVe models. The embedding visualization is also useful to visualize the similarity of word clusters. Therefore, we use t-SNE  dimensionality reduction algorithm for compressing high dimensional embedding into 2-dimensional , coordinate pairs with PCA . The PCA is useful to combine input features by dropping the least important features while retaining the most valuable features.
|||Word embedding||Wiki-dumps (2016)|
|||Text Corpus||4.1M tokens|
|||Corpus development||Wiki-dumps (2016)|
|||Sentiment analysis||31.5K tokens|
|Proposed work||Raw Corpus||61.39 M tokens|
|Word embeddings||61.39M tokens|
This section presents the employed methodology in detail for corpus acquisition, preprocessing, statistical analysis, and generating Sindhi word embeddings.
2.1 Task description
We initiate this work from scratch by collecting large corpus from multiple web resources. After preprocessing and statistical analysis of the corpus, we generate Sindhi word embeddings with state-of-the-art CBoW, SG, and GloVe algorithms. The generated word embeddings are evaluated using the intrinsic evaluation approaches of cosine similarity between nearest neighbors, word pairs, and WordSim-353 for distributional semantic similarity. Moreover, we use t-SNE with PCA for the comparison of the distance between similar words via visualization.
2.2 Corpus acquisition
The corpus is a collection of human language text  built with a specific purpose. However, the statistical analysis of the corpus provides quantitative, reusable data, and an opportunity to examine intuitions and ideas about language. Therefore, the corpus has great importance for the study of written language to examine the text. In fact, realizing the necessity of large text corpus for Sindhi, we started this research by collecting raw corpus from multiple web resource using web-scrappy framwork55 5 https://github.com/scrapy/scrapy for extraction of news columns of daily Kawish66 6 http://kawish.asia/Articles1/index.htm and Awami Awaz77 7 http://www.awamiawaz.com/articles/294/ Sindhi newspapers, Wikipedia dumps88 8 https://dumps.wikimedia.org/sdwiki/20180620/, short stories and sports news from Wichaar99 9 http://wichaar.com/news/134/, accessed in Dec-2018 social blog, news from Focus Word press blog1010 10 https://thefocus.wordpress.com/ accessed in Dec-2018, historical writings, novels, stories, books from Sindh Salamat1111 11 http://sindhsalamat.com/, accessed in Jan-2019 literary websites, novels, history and religious books from Sindhi Adabi Board 1212 12 http://www.sindhiadabiboard.org/catalogue/History/Main_History.HTML and tweets regarding news and sports are collected from twitter1313 13 https://twitter.com/dailysindhtimes.
The preprocessing of text corpus obtained from multiple web resources is a challenging task specially it becomes more complicated when working on low-resourced language like Sindhi due to the lack of open-source preprocessing tools such as NLTK  for English. Therefore, we design a preprocessing pipeline depicted in Figure 1 for the filtration of unwanted data and vocabulary of other languages such as English to prepare input for word embeddings. Whereas, the involved preprocessing steps are described in detail below the Figure 1. Moreover, we reveal the list of Sindhi stop words  which is labor intensive and requires human judgment as well. Hence, the most frequent and least important words are classified as stop words with the help of a Sindhi linguistic expert. The partial list of Sindhi stop words is given in 4. We use python programming language for designing the preprocessing pipeline using regex and string functions.
Input: The collected text documents were concatenated for the input in UTF-8 format.
Replacement symbols: The punctuation marks of a full stop, hyphen, apostrophe, comma, quotation, and exclamation marks replaced with white space for authentic tokenization because without replacing these symbols with white space the words were found joined with their next or previous corresponding words.
Filtration of noisy data: The text acquisition from web resources contain a huge amount of noisy data. Therefore, we filtered out unimportant data such as the rest of the punctuation marks, special characters, HTML tags, all types of numeric entities, email, and web addresses.
Normalization: In this step, We tokenize the corpus then normalize to lower-case for the filtration of multiple white spaces, English vocabulary, and duplicate words. The stop words were only filtered out for preparing input for GloVe. However, the sub-sampling approach in CBoW and SG can discard most frequent or stop words automatically.
2.4 Word embedding models
The NN based approaches have produced state-of-the-art performance in NLP with the usage of robust word embedings generated from the large unlabelled corpus. Therefore, word embeddings have become the main component for setting up new benchmarks in NLP using deep learning approaches. Most recently, the use cases of word embeddings are not only limited to boost statistical NLP applications but can also be used to develop language resources such as automatic construction of WordNet  using the unsupervised approach.
The word embedding can be precisely defined as the encoding of vocabulary into and the word from to vector into -dimensional embedding space. They can be broadly categorized into predictive and count based methods, being generated by employing co-occurrence statistics, NN algorithms, and probabilistic models. The GloVe  algorithm treats each word as a single entity in the corpus and generates a vector of each word. However, CBoW and SG  , later extended  , well-known as word2vec rely on simple two layered NN architecture which uses linear activation function in hidden layer and softmax in the output layer. The work2vec model treats each word as a bag-of-character n-gram.
The GloVe is a log-bilinear regression model  which combines two methods of local context window and global matrix factorization for training word embeddings of a given vocabulary in an unsupervised way. It weights the contexts using the harmonic function, for example, a context word four tokens away from an occurrence will be counted as . The Glove’s implementation represents word and context in -dimensional vectors and in a following way,
Where, is row vector and is is column vector.
2.6 Continuous bag-of-words
The standard CBoW is the inverse of SG  model, which predicts input word on behalf of the context. The length of input in the CBoW model depends on the setting of context window size which determines the distance to the left and right of the target word. Hence the context is a window that contain neighboring words such as by giving a sequence of words , the objective of the CBoW is to maximize the probability of given neighboring words such as,
Where, is context of word for example with window of size .
2.7 Skip gram
The SG model predicts surrounding words by giving input word  with training objective of learning good word embeddings that efficiently predict the neighboring words. The goal of skip-gram is to maximize average log-probability of words across the entire training corpus,
Where, denotes the context of words indices set of nearby words in the training corpus.
Th sub-sampling  approach is useful to dilute most frequent or stop words, also accelerates learning rate, and increases accuracy for learning rare word vectors. Numerous words in English, e.g., ‘the’, ‘you’, ’that’ do not have more importance, but these words appear very frequently in the text. However, considering all the words equally would also lead to over-fitting problem of model parameters  on the frequent word embeddings and under-fitting on the rest. Therefore, it is useful to count the imbalance between rare and repeated words. The sub-sampling technique randomly removes most frequent words with some threshold and probability of words and frequency of words in the corpus.
Where each word is discarded with computed probability in training phase, is frequency of word and are parameters.
2.8.2 Dynamic context window
The traditional word embedding models usually use a fixed size of a context window. For instance, if the window size ws=6, then the target word apart from 6 tokens will be treated similarity as the next word. The scheme is used to assign more weight to closer words, as closer words are generally considered to be more important to the meaning of the target word. The CBoW, SG and GloVe models employ this weighting scheme. The GloVe model weights the contexts using a harmonic function, for example, a context word four tokens away from an occurrence will be counted as . However, CBoW and SG implementation equally consider the contexts by dividing the ws with the distance from target word, e.g. ws=6 will weigh its context by .
2.8.3 Sub-word model
The sub-word model  can learn the internal structure of words by sharing the character representations across words. In that way, the vector for each word is made of the sum of those character . Such as, a vector of a word “table” is a sum of vectors by setting the letter size to as, , we can get all sub-words of ”table” with minimum length of and maximum length of . The and symbols are used to separate prefix and suffix words from other character sequences. In this way, the sub-word model utilizes the principles of morphology, which improves the quality of infrequent word representations. In addition to character , the input word is also included in the set of character , to learn the representation of each word. We obtain scoring function using a input dictionary of with size by giving word , where . A word representation is associated to each . Hence, each word is represented by the sum of character representations, where, is the scoring function in the following equation,
2.8.4 Position-dependent weights
The position-dependent weighting approach  is used to avoid direct encoding of representations for words and their positions which can lead to over-fitting problem. The approach learns positional representations in contextual word representations and used to reweight word embedding. Thus, it captures good contextual representations at lower computational cost,
Where, is individual position in context window associated with vector. Afterwards the context vector reweighted by their positional vectors is average of context words. The relative positional set is in context window and is context vector of respectively.
2.8.5 Shifted point-wise mutual information
The use sparse Shifted Positive Point-wise Mutual Information (SPPMI)  word-context matrix in learning word representations improves results on two word similarity tasks. The CBoW and SG have (number of negatives)   hyperparameter, which affects the value that both models try to optimize for each . Parameter has two functions of better estimation of negative examples, and it performs as before observing the probability of positive examples (actual occurrence of ).
2.8.6 Deleting rare words
Before creating a context window, the automatic deletion of rare words also leads to performance gain in CBoW, SG and GloVe models, which further increases the actual size of context windows.
2.9 Evaluation methods
The intrinsic evaluation is based on semantic similarity  in word embeddings. The word similarity measure approach states  that the words are similar if they appear in the similar context. We measure word similarity of proposed Sindhi word embeddings using dot product method and WordSim353.
2.9.1 Cosine similarity
The cosine similarity between two non-zero vectors is a popular measure that calculates the cosine of the angle between them which can be derived by using the Euclidean dot product method. The dot product is a multiplication of each component from both vectors added together. The result of a dot product between two vectors isn’t another vector but a single value or a scalar. The dot product for two vectors can be defined as: and where and are the components of the vector and is dimension of vectors such as,
However, the cosine of two non-zero vectors can be derived by using the Euclidean dot product formula,
Given two vectors of attributes and , the cosine similarity, , is represented using a dot product and magnitude as,
where and are components of vector and , respectively.
The WordSim353  is popular for the evaluation of lexical similarity and relatedness. The similarity score is assigned with 13 to 16 human subjects with semantic relations  for 353 English noun pairs. Due to the lack of annotated datasets in the Sindhi language, we translated WordSim353 using English to Sindhi bilingual dictionary1414 14 http://dic.sindhila.edu.pk/index.php?txtsrch= for the evaluation of our proposed Sindhi word embeddings and SdfastText. We use the Spearman correlation coefficient for the semantic and syntactic similarity comparison which is used to used to discover the strength of linear or nonlinear relationships if there are no repeated data values. A perfect Spearman’s correlation of or discovers the strength of a link between two sets of data (word-pairs) when observations are monotonically increasing or decreasing functions of each other in a following way,
where is the rank correlation coefficient, denote the number of observations, and is the rank difference between observations.
3 Statistical analysis of corpus
The large corpus acquired from multiple resources is rich in vocabulary. We present the complete statistics of collected corpus (see Table 2) with number of sentences, words and unique tokens.
|Awami awaz||News columns||107,326||7,487,319||65,632|
|Social Blogs||Stories, sports||7,018||254,327||10,615|
|Focus word press||Short Stories||63,251||968,639||28,341|
|Sindhi Adabi Board||History books||478,424||9,757,844||57,854|
3.1 Letter occurrences
The frequency of letter occurrences in human language is not arbitrarily organized but follow some specific rules which enable us to describe some linguistic regularities. The Zipf’s law  suggests that if the frequency of letter or word occurrence ranked in descending order such as,
Where, is the letter frequency of r rank, and are parameters of input text. The comparative letter frequency in the corpus is the total number of occurrences of a letter divided by the total number of letters present in the corpus. The letter frequencies in our developed corpus are depicted in Figure 2; however, the corpus contains 187,620,276 total number of the character set. Sindhi Persian-Arabic alphabet consists of 52 letters but in the vocabulary 59 letters are detected, additional seven letters are modified uni-grams and standalone honorific symbols.
3.2 Letter n-grams frequency
We denote the combination of letter occurrences in a word as n-grams, where each letter is a gram in a word. The letter n-gram frequency is carefully analyzed in order to find the length of words which is essential to develop NLP systems, including learning of word embeddings such as choosing the minimum or maximum length of sub-word for character-level representation learning . We calculate the letter n-grams in words along with their percentage in the developed corpus (see Table 3). The bi-gram words are most frequent, mostly consists of stop words and secondly, 4-gram words have a higher frequency.
|n-grams||Frequency||% in corpus|
3.3 Word Frequencies
The word frequency count is an observation of word occurrences in the text. The commonly used words are considered to be with higher frequency, such as the word “the” in English. Similarly, the frequency of rarely used words to be lower. Such frequencies can be calculated at character or word-level. We calculate word frequencies by counting a word occurrence in the corpus , such as,
Where the frequency of is the sum of every occurrence of in .
3.4 Stop words
The most frequent and least important words in NLP are often classified as stop words. The removal of such words can boost the performance of the NLP model , such as sentiment analysis and text classification. But the construction of such words list is time consuming and requires user decisions. Firstly, we determined Sindhi stop words by counting their term frequencies using Eq. 12, and secondly, by analysing their grammatical status with the help of Sindhi linguistic expert because all the frequent words are not stop words (see Figure 3). After determining the importance of such words with the help of human judgment, we placed them in the list of stop words. The total number of detected stop words is 340 in our developed corpus. The partial list of most frequent Sindhi stop words is depicted in Table 4 along with their frequency. The filtration of stop words is an essential preprocessing step for learning GloVe  word embeddings; therefore, we filtered out stop words for preparing input for the GloVe model. However, the sub-sampling approach   is used to discard such most frequent words in CBoW and SG models.
4 Experiments and results
Hyperparameter optimization is more important than designing a novel algorithm. We carefully choose to optimize the dictionary and algorithm-based parameters of CBoW, SG and GloVe algorithms. Hence, we conducted a large number of experiments for training and evaluation until the optimization of most suitable hyperparameters depicted in Table 5 and discussed in Section 4.1. The choice of optimized hyperparameters is based on The high cosine similarity score in retrieving nearest neighboring words, the semantic, syntactic similarity between word pairs, WordSim353, and visualization of the distance between twenty nearest neighbours using t-SNE respectively. All the experiments are conducted on GTX 1080-TITAN GPU.
4.1 Hyperparameter optimization
The state-of-the-art SG, CBoW     and Glove  word embedding algorithms are evaluated by parameter tuning for development of Sindhi word embeddings. These parameters can be categories into dictionary and algorithm based, respectively. The integration of character n-gram in learning word representations is an ideal method especially for rich morphological languages because this approach has the ability to compute rare and misspelled words. Sindhi is also a rich morphological language. Therefore more robust embeddings became possible to train with the hyperparameter optimization of SG, CBoW and GloVe algorithms. We tuned and evaluated the hyperparameters of three algorithms individually which are discussed as follows:
Number of Epochs: Generally, more epochs on the corpus often produce better results but more epochs take long training time. Therefore, we evaluate , , and epochs for each word embedding model, and epochs constantly produce good results.
Learning rate (lr): We tried lr of , , and , the optimal lr gives the better results for training all the embedding models.
Dimensions (): We evaluate and compare the quality of , , and using WordSim353 on different , and the optimal are evaluated with cosine similarity matrix for querying nearest neighboring words and calculating the similarity between word pairs. The embedding dimensions have little affect on the quality of the intrinsic evaluation process. However, the selection of embedding dimensions might have more impact on the accuracy in certain downstream NLP applications. The lower embedding dimensions are faster to train and evaluate.
Character n-grams: The selection of minimum (minn) and the maximum (maxn) length of character is an important parameter for learning character-level representations of words in CBoW and SG models. Therefore, the n-grams from were tested to analyse the impact on the accuracy of embedding. We optimized the length of character n-grams from and by keeping in view the word frequencies depicted in Table 3.
Window size (ws): The large ws means considering more context words and similarly less ws means to limit the size of context words. By changing the size of the dynamic context window, we tried the ws of 3, 5, 7 the optimal ws=7 yield consistently better performance.
Negative Sampling (NS): : The more negative examples yield better results, but more negatives take long training time. We tried 10, 20, and 30 negative examples for CBoW and SG. The best negative examples of 20 for CBoW and SG significantly yield better performance in average training time.
Minimum word count (minw): We evaluated the range of minimum word counts from 1 to 8 and analyzed that the size of input vocabulary is decreasing at a large scale by ignoring more words similarly the vocabulary size was increasing by considering rare words. Therefore, by ignoring words with a frequency of less than 4 in CBoW, SG, and GloVe consistently yields better results with the vocabulary of 200,000 words.
Loss function (ls): we use hierarchical softmax (hs) for CBoW, negative sampling (ns) for SG and default loss function for GloVe .
5 Word similarity comparison of Word Embeddings
5.1 Nearest neighboring words
The cosine similarity matrix  is a popular approach to compute the relationship between all embedding dimensions of their distinct relevance to query word. The words with similar context get high cosine similarity and geometrical relatedness to Euclidean distance, which is a common and primary method to measure the distance between a set of words and nearest neighbors. Each word contains the most similar top eight nearest neighboring words determined by the highest cosine similarity score using Eq. 9. We present the English translation of both query and retrieved words also discuss with their English meaning for ease of relevance judgment between the query and retrieved words.To take a closer look at the semantic and syntactic relationship captured in the proposed word embeddings, Table 6 shows the top eight nearest neighboring words of five different query words Friday, Spring, Cricket, Red, Scientist taken from the vocabulary. As the first query word Friday returns the names of days Saturday, Sunday, Monday, Tuesday, Wednesday, Thursday in an unordered sequence. The SdfastText returns five names of days Sunday, Thursday, Monday, Tuesday and Wednesday respectively. The GloVe model also returns five names of days. However, CBoW and SG gave six names of days except Wednesday along with different writing forms of query word Friday being written in the Sindhi language which shows that CBoW and SG return more relevant words as compare to SdfastText and GloVe. The CBoW returned Add and GloVe returns Honorary words which are little similar to the querry word but SdfastText resulted two irrelevant words Kameeso (N) which is a name (N) of person in Sindhi and Phrase is a combination of three Sindhi words which are not tokenized properly. Similarly, nearest neighbors of second query word Spring are retrieved accurately as names and seasons and semantically related to query word Spring by CBoW, SG and Glove but SdfastText returned four irrelevant words of Dilbahar (N), Pharase, Ashbahar (N) and Farzana (N) out of eight. The third query word is Cricket, the name of a popular game. The first retrieved word in CBoW is Kabadi (N) that is a popular national game in Pakistan. Including Kabadi (N) all the returned words by CBoW, SG and GloVe are related to Cricket game or names of other games. But the first word in SdfastText contains a punctuation mark in retrieved word Gone.Cricket that are two words joined with a punctuation mark (.), which shows the tokenization error in preprocessing step, sixth retrieved word Misspelled is a combination of three words not related to query word, and Played, Being played are also irrelevant and stop words. Moreover, fourth query word Red gave results that contain names of closely related to query word and different forms of query word written in the Sindhi language. The last returned word Unknown by SdfastText is irrelevant and not found in the Sindhi dictionary for translation. The last query word Scientist also contains semantically related words by CBoW, SG, and GloVe, but the first Urdu word given by SdfasText belongs to the Urdu language which means that the vocabulary may also contain words of other languages. Another unknown word returned by SdfastText does not have any meaning in the Sindhi dictionary. More interesting observations in the presented results are the diacritized words retrieved from our proposed word embeddings and The authentic tokenization in the preprocessing step presented in Figure 1. However, SdfastText has returned tri-gram words of Phrase in query words Friday, Spring, a Misspelled word in Cricket and Scientist query words. Hence, the overall performance of our proposed SG, CBoW, and GloVe demonstrate high semantic relatedness in retrieving the top eight nearest neighbor words.
5.2 Word pair relationship
Generally, closer words are considered more important to a word’s meaning. The word embeddings models have the ability to capture the lexical relations between words. Identifying such relationship that connects words is important in NLP applications. We measure that semantic relationship by calculating the dot product of two vectors using Eq. 9. The high cosine similarity score denotes the closer words in the embedding matrix, while less cosine similarity score means the higher distance between word pairs. We present the cosine similarity score of different semantically or syntactically related word pairs taken from the vocabulary in Table 7 along with English translation, which shows the average similarity of 0.632, 0.650, 0.591 yields by CBoW, SG and GloVe respectively. The SG model achieved a high average similarity score of 0.650 followed by CBoW with a 0.632 average similarity score. The GloVe also achieved a considerable average score of 0.591 respectively. However, the average similarity score of SdfastText is 0.388 and the word pair Microsoft-Bill Gates is not available in the vocabulary of SdfastText. This shows that along with performance, the vocabulary in SdfastText is also limited as compared to our proposed word embeddings.
Moreover, the average semantic relatedness similarity score between countries and their capitals is shown in Table 8 with English translation, where SG also yields the best average score of 0.663 followed by CBoW with 0.611 similarity score. The GloVe also yields better semantic relatedness of 0.576 and the SdfastText yield an average score of 0.391. The first query word China-Beijing is not available the vocabulary of SdfastText. However, the similarity score between Afghanistan-Kabul is lower in our proposed CBoW, SG, GloVe models because the word Kabul is the name of the capital of Afghanistan as well as it frequently appears as an adjective in Sindhi text which means able.
5.3 Comparison with WordSim353
We evaluate the performance of our proposed word embeddings using the WordSim353 dataset by translation English word pairs to Sindhi. Due to vocabulary differences between English and Sindhi, we were unable to find the authentic meaning of six terms, so we left these terms untranslated. So our final Sindhi WordSim353 consists of 347 word pairs. Table 9 shows the Spearman correlation results using Eq. 10 on different dimensional embeddings on the translated WordSim353. The Table 9 presents complete results with the different ws for CBoW, SG and GloVe in which the ws=7 subsequently yield better performance than ws of 3 and 5, respectively. The SG model outperforms CBoW and GloVe in semantic and syntactic similarity by achieving the performance of 0.629 with ws=7. In comparison with English  achieved the average semantic and syntactic similarity of 0.637, 0.656 with CBoW and SG, respectively. Therefore, despite the challenges in translation from English to Sindhi, our proposed Sindhi word embeddings have efficiently captured the semantic and syntactic relationship.
We use t-Distributed Stochastic Neighboring (t-SNE) dimensionality  reduction algorithm with PCA  for exploratory embeddings analysis in 2-dimensional map. The t-SNE is a non-linear dimensionality reduction algorithm for visualization of high dimensional datasets. It starts the probability calculation of similar word clusters in high-dimensional space and calculates the probability of similar points in the corresponding low-dimensional space. The purpose of t-SNE for visualization of word embeddings is to keep similar words close together in 2-dimensional coordinate pairs while maximizing the distance between dissimilar words. The t-SNE has a perplexity (PPL) tunable parameter used to balance the data points at both the local and global levels. We visualize the embeddings using PPL=20 on 5000-iterations of 300-D models. We use the same query words (see Table 6) by retrieving the top 20 nearest neighboring word clusters for a better understanding of the distance between similar words. Every query word has a distinct color for the clear visualization of a similar group of words. The closer word clusters show the high similarity between the query and retrieved word clusters. The word clusters in SG (see Fig. 5) are closer to their group of semantically related words. Secondly, the CBoW model depicted in Fig. 4 and GloVe Fig. 6 also show the better cluster formation of words than SdfastText Fig. 7, respectively.
6 Discussion and future work
In this era of the information age, the existence of LRs plays a vital role in the digital survival of natural languages because the NLP tools are used to process a flow of un-structured data from disparate sources. It is imperative to mention that presently, Sindhi Persian-Arabic is frequently used in online communication, newspapers, public institutions in Pakistan and India. Due to the growing use of Sindhi on web platforms, the need for its LRs is also increasing for the development of language technology tools. But little work has been carried out for the development of resources which is not sufficient to design a language independent or machine learning algorithms. The present work is a first comprehensive initiative on resource development along with their evaluation for statistical Sindhi language processing. More recently, the NN based approaches have produced a state-of-the-art performance in NLP by exploiting unsupervised word embeddings learned from the large unlabelled corpus. Such word embeddings have also motivated the work on low-resourced languages. Our work mainly consists of novel contributions of resource development along with comprehensive evaluation for the utilization of NN based approaches in SNLP applications. The large corpus obtained from multiple web resources is utilized for the training of word embeddings using SG, CBoW and Glove models. The intrinsic evaluation along with comparative results demonstrates that the proposed Sindhi word embeddings have accurately captured the semantic information as compare to recently revealed SdfastText word vectors. The SG yield best results in nearest neighbors, word pair relationship and semantic similarity. The performance of CBoW is also close to SG in all the evaluation matrices. The GloVe also yields better word representations; however SG and CBoW models surpass the GloVe model in all evaluation matrices. Hyperparameter optimization is as important as designing a new algorithm. The choice of optimal parameters is a key aspect of performance gain in learning robust word embeddings. Moreover, We analysed that the size of the corpus and careful preprocessing steps have a large impact on the quality of word embeddings. However, in algorithmic perspective, the character-level learning approach in SG and CBoW improves the quality of representation learning, and overall window size, learning rate, number of epochs are the core parameters that largely influence the performance of word embeddings models. Ultimately, the new corpus of low-resourced Sindhi language, list of stop words and pretrained word embeddings along with empirical evaluation, will be a good supplement for future research in SSLP applications. In the future, we aim to use the corpus for annotation projects such as parts-of-speech tagging, named entity recognition. The proposed word embeddings will be refined further by creating custom benchmarks and the extrinsic evaluation approach will be employed for the performance analysis of proposed word embeddings. Moreover, we will also utilize the corpus using Bi-directional Encoder Representation Transformer  for learning deep contextualized Sindhi word representations. Furthermore, the generated word embeddings will be utilized for the automatic construction of Sindhi WordNet.
In this paper, we mainly present three novel contributions of large corpus development contains large vocabulary of more than 61 million tokens, 908,456 unique words. Secondly, the list of Sindhi stop words is constructed by finding their high frequency and least importance with the help of Sindhi linguistic expert. Thirdly, the unsupervised Sindhi word embeddings are generated using state-of-the-art CBoW, SG and GloVe algorithms and evaluated using popular intrinsic evaluation approaches of cosine similarity matrix and WordSim353 for the first time in Sindhi language processing. We translate English WordSim353 using the English-Sindhi bilingual dictionary, which will also be a good resource for the evaluation of Sindhi word embeddings. Moreover, the proposed word embeddings are also compared with recently revealed SdfastText word representations.
Our empirical results demonstrate that our proposed Sindhi word embeddings have captured high semantic relatedness in nearest neighboring words, word pair relationship, country, and capital and WordSim353. The SG yields the best performance than CBoW and GloVe models subsequently. However, the performance of GloVe is low on the same vocabulary because of character-level learning of word representations and sub-sampling approaches in SG and CBoW. Our proposed Sindhi word embeddings have surpassed SdfastText in the intrinsic evaluation matrix. Also, the vocabulary of SdfastText is limited because they are trained on a small Wikipedia corpus of Sindhi Persian-Arabic. We will further investigate the extrinsic performance of proposed word embeddings on the Sindhi text classification task in the future. The proposed resources along with systematic evaluation will be a sophisticated addition to the computational resources for statistical Sindhi language processing.
-  Jennifer Cole. Sindhi. encyclopedia of language & linguistics volume8, 2006.
-  Raveesh Motlani. Developing language technology tools and resources for a resource-poor language: Sindhi. In Proceedings of the NAACL Student Research Workshop, pages 51–58, 2016.
-  Hidayatullah Shaikh, Javed Ahmed Mahar, and Mumtaz Hussain Mahar. Instant diacritics restoration system for sindhi accent prediction using n-gram and memory-based learning approaches. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 8(4):149–157, 2017.
-  Abdul-Majid Bhurgri. Enabling pakistani languages through unicode. Microsoft Corporation white paper at http://download. microsoft. com/download/1/4/2/142aef9f-1a74-4a24-b1f4-782d48d41a6d/PakLang. pdf, 2006.
-  Wazir Ali Jamro. Sindhi language processing: A survey. In 2017 International Conference on Innovations in Electrical Engineering and Computational Technologies (ICIEECT), pages 1–8. IEEE, 2017.
-  Edward Loper and Steven Bird. Nltk: the natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics-Volume 1, pages 63–70. Association for Computational Linguistics, 2002.
-  Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pages 55–60, 2014.
-  Wanxiang Che, Zhenghua Li, and Ting Liu. Ltp: A chinese language technology platform. In Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations, pages 13–16. Association for Computational Linguistics, 2010.
-  Martin Popel and Zdeněk Žabokrtskỳ. Tectomt: modular nlp framework. In International Conference on Natural Language Processing, pages 293–304. Springer, 2010.
-  Lluís Padró, Miquel Collado, Samuel Reese, Marina Lloberes, and Irene Castellón. Freeling 2.1: Five years of open-source language processing tools. In 7th International Conference on Language Resources and Evaluation, 2010.
-  Waqar Ali Narejo and Javed Ahmed Mahar. Morphology: Sindhi morphological analysis for natural language processing applications. In 2016 International Conference on Computing, Electronic and Electrical Engineering (ICE Cube), 2016.
-  Yang Li and Tao Yang. Word embedding for understanding natural language: a survey. In Guide to Big Data Applications, pages 83–104. Springer, 2018.
-  Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM, 2008.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
-  Mutee U Rahman. Towards sindhi corpus construction. In Conference on Language and Technology, Lahore, Pakistan, 2010.
-  Fida Hussain Khoso, Mashooque Ahmed Memon, Haque Nawaz, and Sayed Hyder Abbas Musavi. To build corpus of sindhi. 2019.
-  Mazhar Ali Dootio and Asim Imdad Wagan. Unicode-8 based linguistics data set of annotated sindhi text. Data in brief, 19:1504–1514, 2018.
-  Mazhar Ali Dootio and Asim Imdad Wagan. Development of sindhi text corpus. Journal of King Saud University-Computer and Information Sciences, 2019.
-  Mazhar Ali and Asim Imdad Wagan. Sentiment summerization and analysis of sindhi text. Int. J. Adv. Comput. Sci. Appl, 8(10):296–300, 2017.
-  Kevin Lund and Curt Burgess. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior research methods, instruments, & computers, 28(2):203–208, 1996.
-  Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
-  Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.
-  Jacob Andreas and Dan Klein. How much do word embeddings encode about syntax? In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 822–827, 2014.
-  Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 298–307, 2015.
-  Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. Advances in pre-training distributed word representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), 2018.
-  Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. Learning word vectors for 157 languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), 2018.
-  Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
-  Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
-  Neha Nayak, Gabor Angeli, and Christopher D Manning. Evaluating word embeddings using a representative suite of practical tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pages 19–23, 2016.
-  Bénédicte Pierrejean and Ludovic Tanguy. Towards qualitative word embeddings evaluation: measuring neighbors variation. 2018.
-  Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Paşca, and Aitor Soroa. A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 19–27. Association for Computational Linguistics, 2009.
-  Roland Schäfer and Felix Bildhauer. Web corpus construction. Synthesis Lectures on Human Language Technologies, 6(4):1–145, 2013.
-  Zeeshan Bhatti, Imdad Ali Ismaili, Waseem Javaid Soomro, and Dil Nawaz Hakro. Word segmentation model for sindhi text. American Journal of Computing Research Repository, 2(1):1–7, 2014.
-  Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017.
-  Siyu Qiu, Qing Cui, Jiang Bian, Bin Gao, and Tie-Yan Liu. Co-learning of word representations and morpheme representations. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 141–150, 2014.
-  Omer Levy, Yoav Goldberg, and Ido Dagan. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3:211–225, 2015.
-  Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
-  Rémi Lebret and Ronan Collobert. Word emdeddings through hellinger pca. arXiv preprint arXiv:1312.5542, 2013.
-  Amaresh Kumar Pandey and Tanvver J Siddiqui. Evaluating effect of stemming and stop-word removal on hindi text retrieval. In Proceedings of the First International Conference on Intelligent Human Computer Interaction, pages 316–326. Springer, 2009.
-  Mikhail Khodak, Andrej Risteski, Christiane Fellbaum, and Sanjeev Arora. Automated wordnet construction using word embeddings. In Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pages 12–23, 2017.
-  Andriy Mnih and Koray Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. In Advances in neural information processing systems, pages 2265–2273, 2013.
-  Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pages 2177–2185, 2014.
-  Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. Placing search in context: The concept revisited. ACM Transactions on information systems, 20(1):116–131, 2002.
-  Alvaro Corral, Gemma Boleda, and Ramon Ferrer-i Cancho. Zipf’s law for word frequencies: Word forms versus lemmas in long texts. PloS one, 10(7):e0129031, 2015.