Sinhala is the native language of the Sinhalese people who make up thelargest ethnic group of Sri Lanka. The language belongs to the globe-spanninglanguage tree, Indo-European. However, due to poverty in both linguistic andeconomic capital, Sinhala, in the perspective of Natural Language Processingtools and research, remains a resource-poor language which has neither theeconomic drive its cousin English has nor the sheer push of the law of numbersa language such as Chinese has. A number of research groups from Sri Lanka havenoticed this dearth and the resultant dire need for proper tools and researchfor Sinhala natural language processing. However, due to various reasons, theseattempts seem to lack coordination and awareness of each other. The objectiveof this paper is to fill that gap of a comprehensive literature survey of thepublicly available Sinhala natural language tools and research so that theresearchers working in this field can better utilize contributions of theirpeers. As such, we shall be uploading this paper to arXiv and perpetuallyupdate it periodically to reflect the advances made in the field.
Quick Read (beta)
Survey on Publicly Available Sinhala Natural Language Processing Tools and Research
Sinhala is the native language of the Sinhalese people who make up the largest ethnic group of Sri Lanka. The language belongs to the globe-spanning language tree, Indo-European. However, due to poverty in both linguistic and economic capital, Sinhala, in the perspective of Natural Language Processing tools and research, remains a resource-poor language which has neither the economic drive its cousin English has nor the sheer push of the law of numbers a language such as Chinese has. A number of research groups from Sri Lanka have noticed this dearth and the resultant dire need for proper tools and research for Sinhala natural language processing. However, due to various reasons, these attempts seem to lack coordination and awareness of each other. The objective of this paper is to fill that gap of a comprehensive literature survey of the publicly available Sinhala natural language tools and research so that the researchers working in this field can better utilize contributions of their peers. As such, we shall be uploading this paper to arXiv and perpetually update it periodically to reflect the advances made in the field.
Sinhala11 1 englebretson2005santa observe that in some contexts the Sinhala language is also referred as Sinhalese, Singhala, and Singhalese language, being the native language of the Sinhalese people [disanayaka1976national, perera1985sinhala, bauer2007linguistics], who make up the largest ethnic group of the island country of Sri Lanka, enjoys being reported as the mother tongue of Approximately 16 million people [2007Percentage, gair1974literary]. To give a brief linguistic background for the purpose of aligning the Sinhala language with the baseline of English, primarily it should be noted that Sinhala language belongs same the Indo-European language tree [Young2015language, kanduboda2011role]. However, unlike English, which is part of the Germanic branch, Sinhala belongs to the Indo-Aryan branch. Further, Sinhala, unlike English, which borrowed the Latin alphabet, has its own writing system, which is a descendant of the Indian Brahmi script [fernando1949palaeographical, bandara2012creation, daniels1996world, sirisoma1990brahmi, dias1996lakdiwa, hettiarachchi1990investigation, paranavitana1970inscriptions]. By extension, this makes Sinhala Script a member of the Aramaic family of scripts [salomon1998indian, falk1993schrift]. Thus by inheritance, Sinhala writing system is abugida (alphasyllabary), which to say that consonant-vowel sequences are written as single units [hettige2011computational]. It should be noted that the modern Sinhala language have loanwords from languages such as Tamil, English, Portuguese, and Dutch due to various historical reasons [gunasekara1986comprehensive]. Regardless of the rich historical array of literature spanning several millennia (starting between to century BCE [ray2003archaeology, herath1994practical]), modern natural language processing tools for the Sinhala language are scarce [de2015Sinhala].
Natural Language Processing (NLP) is a broad area covering all computational processing and analysis of human languages. To achieve this end, NLP systems operate at different levels [Wijeratne2019Natural, liddy2001natural, wimalasuriya2010ontology]. A graphical representation of NLP layers and application domains are shown in Figure 1. On one hand, according to liddy2001natural, these systems can be categorized into the following layers; phonological, morphological, lexical, syntactic, semantic, discourse, and pragmatic. The phonological layer deals with the interpretation of language sounds. As such, it consists of mainly speech-to-text and text-to-speech systems. In cases where one is working with written text of the language rather than speech, it is possible to replace this layer with tools which handle Optical Character Recognition (OCR) and language rendering standards (such as Unicode [unicode1996unicode]). The morphological layer analyses words at their smallest units of meaning. As such, analysis on word lemmas and prefix-suffix-based inflection are handled in this layer. Lexical layer handles individual words. Therefore tasks such as Part of Speech (PoS) tagging happens here. The next layer, syntactic, takes place at the phrase and sentence level where grammatical structures are utilized to obtain meaning. Semantic layer attempts to derive the meanings from the word level to the sentence level. Starting with Named Entity Recognition (NER) at the word level and working its way up by identifying the contexts they are set in until arriving at overall meaning. The discourse layer handles meaning in textual units larger than a sentence. In this, the function of a particular sentence maybe contextualized within the document it is set in. Finally, the pragmatic layer handles contexts read into contents without having to be explicitly mentioned [Wijeratne2019Natural, liddy2001natural]. Some forms of anaphora (coreference) resolution [van1992presupposition, lappin1994algorithm, soon2001machine, ng2002improving, mitkov2014anaphora] fall into this application.
On the other hand, wimalasuriya2010ontology categorize NLP tools and research by utility. They introduce three categories with increasing complexity; Information Retrieval (IR), Information Extraction (IE), and Natural Language Understanding (NLU). Information Retrieval covers applications, which search and retrieve information which are relevant to a given query. For pure IR, tools and methods up-to and including the syntactic layer in the above analysis are used. Information Extraction, on the other hand, extracts structured information. The difference between IR and IE is the fact that IR does not change the structure of the documents in question. Be them structured, semi-structured, or unstructured, all IR does is fetching them as they are. In comparison, IE, takes semi-structured or unstructured text and puts them in a machine readable structure. For this, IE utilizes all the layers used by IR and the semantic layer. Natural Language Understanding is purely the idea of cognition. Most NLU tasks fall under AI-hard category and remain unsolved [Wijeratne2019Natural]. However, with varying accuracy, some NLU tasks such as machine translation22 2 This is, however, not without the criticism of being nothing more than a Chinese room [preston2002views] rather than true NLU. are being attempted. The pragmatic layer of the above analysis belongs to the NLU tasks while the discourse layer straddles information extraction and natural language understanding [Wijeratne2019Natural].
The objective of this paper is to serve as a comprehensive survey on the state of natural language processing resources for the Sinhala language. The initial structure and content of this survey are heavily influenced by the preliminary surveys carried out by de2015Sinhala and Wijeratne2019Natural. However, our hope is to host this survey at arXiv as a perpetually evolving work which continuously gets updated as new research and tools for Sinhala language are created and made publicly available. Hence, it is our hope that this work will help future researchers who are engaged in Sinhala NLP research to conduct their literature surveys efficiently and comprehensively. For the success of this survey, we shall also consider the Sri Lankan NLP tools repository, lknlp33 3 https://github.com/lknlp/lknlp.github.io. This manuscript is at version . The latest version of the manuscript can be obtained from arXiv44 4 https://arxiv.org/abs/1906.02358 or ResearchGate55 5 http://bit.ly/31AhvvR.
The remainder of this survey is organized as follows; Section 2 introduces some important properties and conventions of the Sinhala language which are important for the development and understanding of Sinhala NLP. Section 3 discusses the various tools and research available for Sinhala NLP. In this section we would discuss both pure Sinhala NLP tools and research as well as hybrid Sinhala-English work. We will also discuss research and tools which contributes to Sinhala NLP either along with or by the help of Tamil, the other official language of Sri Lanka. Section 4 gives a brief introduction to the primary language sources used by the studies discussed in this work. Finally, Section 5, concludes the survey.
2 Properties of the Sinhala Language
Before moving on to discussing Sinhala NLP resources, we shall give a brief introduction to some of the important properties of Sinhala language, which impact the development of Sinhala NLP resources. Sinhala grammar has two forms: written (literary) and spoken. These forms differ from each-other in their core grammatical structures [fairbanks1968colloquial, englebretson2005santa, miyagishi2005accusative]. The written form strictly adheres to the SOV (Subject, Object, and Verb) configuration [disanayaka1985say, pallatthara1966sinhala]. Further, in the written form, subject-verb agreement is enforced [kanduboda2013usage] such that, in order to be grammatically correct, the subject and the verb must agree in terms of: gender (male/female), number (singular/plural) and person (1st/2nd/3rd). However, in spoken Sinhala, the SOV order can be neglected [liyanage2012computational] and male singular 3rd person verb can be used for all nouns [kanduboda2013usage]. Sinhala is also a head-final language, where the complements and modifiers would appear before their heads [karunatilaka1997sinhala] this is similar to that of English and dissimilar to that of French. In total, according to Abhayasinghe1998sinhala, there are 25 types of simple sentence structures in Sinhala. Similar to many Indo Aryan languages, animacy plays a major role in Sinhala grammar in syntactic and semantic roles [jany2006relationship, garland2005morphological, henderson2005between]. Comparative studies done by noguchi1984shinharago and by miyagishi2003comparison, miyagishi2005accusative have found that animacy extends its influence from phrase level to sentence level in Sinhala (e.g., Usage of post-positions [disanayaka1985say, chandralal2010sinhala]). On this matter, Table I explains grammatical cases and inflections of animate common nouns while Table II explains grammatical cases and inflections of inanimate common nouns. We provide a comparative analysis of parsing the very simple English sentence “I eat a red apple” and its Sinhala, Hindi, and French translations in Fig 2. English and French parsing was done using the Stanford Parser66 6 http://nlp.stanford.edu:8080/parser/. Hindi parsing was done using the IIIT-Hyderabad Parser77 7 http://ltrc.iiit.ac.in/analyzer/ and the study by singh2016syntax.
herath1989sinhalese, herath1990formalization argue that pure Sinhala words did not have suffixes and that adding suffixes was incorporated to Sinhala after 12th century BC with the influx of Sanskrit words. With this, they declare Sinhala to have to following types of words:
Conjunctions and articles
Demonstratives, Interrogatives, and negatives
Particles and prefixes
They further divide nouns into five groups: material, agentive, common, abstract, and proper. In addition to these, they also introduce compound nouns.We show the noun categorization proposed by herath1989sinhalese in Table III.
herath1990formalization categorize Sinhala suffixes along the attributes of: gender, number, definiteness, case, and conjunctive. They further claim that there are 3 types of suffixes: Suf1 adds gender, number, and definiteness; Suf2 adds case; and Suf3 adds conjunctive. Conjunctive is claimed to be equivalent to too and and in English. We show an extension of the suffix structure proposed by herath1990formalization in Table IV.
3 Sinhala NLP resources
In this section we generally follow the structure shown in Figure 1 for sectioning. However, in addition to that, we also discuss topics such as available corpora, other data sets, dictionaries, and WordNets. We focus on NLP tools and research rather than the mechanics of language script handling [samaranayake1989standard, samaranayake2003introduction, dias2004development, dias2005challenges, weerasinghe2006sinhala, sandeva2009design]. One of the earliest attempts on Sinhala NLP was done by herath1991machine. However, progress on that project has been minimal due to the limitations of their time. The later work by nandasara2009past has not caught much of the advances done up to the time of its publication. Given that it was a decade old by the time the first edition of this survey was compiled, we observe the existence of many new discoveries in Sinhala NLP which have not been taken into account by it. A review on some challenges and opportunities of using Sinhala in computer science was done by nandasara2016bridging. At this point, it is worthy to note that the largest number of studies in Sinhala NLP has been on optical character recognition (OCR) rather than on higher levels of the hierarchy shown in Figure 1. On the other hand, the most prolific single project of Sinhala NLP we have observed so far is an attempt to create an end-to-end Sinhala-to-English translator [hettige2006morphological, hettige2006parser, hettige2006first, hettige2007developing, hettige2007transliteration, hettige2007using, hettige2008web, hettige2008web1, hettige2009theoretical, hettige2010evaluation, hettige2010varanageema, hettige2011computational, hettige2013selected, hettige2012multi, hettige2013masmt, hettige2014sinhala, hettige2016multi, hettige2017phrase]. Tamil, the other official language of Sri Lanka is also a resource poor language. However, due to the existence of larger populations of Tamil speakers worldwide, including but not limited to economic powerhouses such as India, there are more research and tools available for Tamil NLP tasks [Wijeratne2019Natural]. Therefore, it is rational to notice that Sinhala and Tamil NLP endeavours can help each other. Especially, given the above fact, that these are official languages of Sri Lanka, results in the generation of parallel data sets in the form of official government documents and local news items. A number of researchers make use of this opportunity. We shall be discussing those applications in this paper as well. Further, there have been some fringe implementations, which bridge Sinhala with other languages such as Japanese [herath1994practical, herath1993generation, herath1996bunsetsu, thelijjagoda2004japanese, kanduboda2011role].
For any language, the key for NLP applications and implementations is the existence of adequate corpora. On this matter, a relatively substantial Sinhala text corpus88 8 https://osf.io/a5quv/ was created by upeksha2015implementing, upeksha2015comparison by web crawling. Later a smaller Sinhala newes corpus99 9 https://osf.io/tdb84/ was created by de2015Sinhala. Both of the above corpora are publicly available. However, none of these come close to the massive capacity and range of the existing English corpora. A word corpus of approximately 35,000 entries was developed by weerasinghe2009corpus. But it does not seem to be online anymore. A number of Sinhala-English parallel corpora were introduced by guzman2019two. This includes a 600k+ Sinhala-English subtitle pairs1010 10 http://bit.ly/2KsFQxm initially collected by [lison2016opensubtitles2016], 45k+ Sinhala-English sentence pairs from GNOME1111 11 http://bit.ly/2Z8q0fo, KDE1212 12 http://bit.ly/2WLY6bI, and Ubuntu1313 13 http://bit.ly/2wLVZGt. guzman2019two further provided two monolingual corpora for Sinhala. Those were a 155k+ sentences of filtered Sinhala Wikipedia1414 14 http://bit.ly/2EQZ7oM and 5178k+ sentences of Sinhala common crawl1515 15 http://bit.ly/2ZaQFZo. wijeratne2020sinhala have publicly released1616 16 https://bit.ly/2GEI4d6 a massive corpus of text and stop words taken from a decade of Sinhala Facebook posts.
As for Sinhala-Tamil corpora, hameed2016automatic claim to have built a sentence aligned Sinhala-Tamil parallel corpus and mohamed2017automatic claim to have built a word aligned Sinhala-Tamil parallel corpus. However, at the time of writing this paper, neither of them was publicly available. A very small Sinhala-Tamil aligned parallel corpus created by farhath2018improving using order papers of government of Sri Lanka is available to download1717 17 http://bit.ly/2HTMEme.
3.2 Data Sets
Specific data sets for Sinhala, as expected, is scarce. However, a Sinhala PoS tagged data set [fernando2016comprehensive, dilshani2017comprehensive, fernando2018evaluation] is available to download from github1818 18 http://bit.ly/2Krhrbv. Further, a Sinhala NER data set created by manamini2016ananya is also available to download from github1919 19 http://bit.ly/2XrwCoK.
Facebook has released FastText [joulin2016fasttext, bojanowski2017enriching, joulin2017bag] models for the Sinhala language trained using the Wikipedia corpus. They are available as both text models2020 20 http://bit.ly/2JXAyL8 and binary files2121 21 http://bit.ly/2JY5J9c. Using the above models by Facebook, a group at University of Moratuwa has created an extended FastText model trained on Wikipedia, News, and official government documents. The binary file2222 22 http://bit.ly/2WowH0h of the trained model is available to be downloaded. herathresearch has compiled a report on the Sinhala lexicon for the purpose of establishing a basis for NLP applications. A comparative analysis of Sinhala word embedding has been conducted by lakmal2020word.
A necessary component for the purpose of bridging Sinhala and English resources are English-Sinhala dictionaries. The earliest and most extensive Sinhala-English dictionary available for consumption was by malalasekera1967english. However, this dictionary is locked behind copyright laws and is not available for public research and development. This copyright issue is shared with other printed dictionaries [jayathilake1937sinhala, maitipe1988gunasena, weerasinghe1999godage, wijayathunga2003maha, ranaweera2004wasana, gunaratne2006ratna] as well. The dictionary by Madura2018Madura is publicly available for usage through an online web interface but does not provide API access or means to directly access the data set. The largest publicly available English-Sinhala dictionary data set is from a discontinued FireFox plug-in EnSiTip [wasala2008ensitip] which bears a more than passing resemblance to the above dictionary by Madura2018Madura. hettige2007developing claim to to have created a lexicon to help in their attempt to create a system capable of English-to-Sinhala machine translation. A review on the requirements for English-Sinhala smart bilingual dictionary was conducted by samarawickramarequirements.
There exists the government sponsored trilingual dictionary [Lang2018Tri], which matches Sinhala, English, and Tamil. However, other than a crude web interface on the ministry website, there is no efficient API or any other way for a researcher to access the data on this dictionary. weerasinghe2013construction have created a multilingual place name database for Sri Lanka which may function both as a dictionary and a resource for certain NER tasks.
WordNets [miller1995wordnet] are extremely powerful and act as a versatile component of many NLP applications. They encompass a number of linguistic properties which exist between the words in the lexicon of the language including but not limited to: hyponymy, hypernymy, synonymy, and meronymy. Their uses range from simple gazetteer listing applications [wimalasuriya2010ontology] to information extraction based on semantic similarity [wu1994verbs, jiang1997semantic] or semantic oppositeness [de2017discovering]. An attempt has been made to build a Sinhala Wordnet by wijesiri2014building. For a time it was hosted on [Sinhala2015] but it too is now defunct and all the data and applications are lost. However, even at its peak, due to the lack of volunteers for the crowd-soured methodology of populating the WordNet, it was at best an incomplete product. Another effort to build a Sinhala Wordnet was initiated by welgama2011towards independently from above; but it too have stopped progression even before achieving the completion level of wijesiri2014building.
3.5 Morphological Analyzers
As shown in Fig 1, morphological analysis is a ground level necessary component for natural language processing. Given that Sinhala is a highly inflected language [liyanage2012computational, kanduboda2013usage, de2015Sinhala], a proper morphological analysis process is vital. The earliest attempt on Sinhala morphological analysis we have observed are the studies by herath1989sinhalese, herath1990formalization. They are more of an analysis of Sinhala morphology rather than a working tool. As such we discussed the observations and conclusions of these works at Section 2. It is also worth to note that these works predates the introduction of Sinhala unicode and thus use a transliteration of Sinhala in the Latin alphabet.
The next attempt by herath1992analysis creates a modular unit structure for morphological analysis of Sinhala. Much later, as a step on their efforts to create a system with the ability to do English-to-Sinhala machine translation, hettige2006morphological claim to have created a morphological analyzer (void of any public data or code), which links to their studies of a Sinhala parser [hettige2006parser] and computational grammar [hettige2011computational]. hettige2012multi further propose a multi-agent System for morphological analysis. welgama2013evaluating attempted to evaluate machine learning approaches for Sinhala morphological analysis. Yet another independent attempt to create a morphological parser for Sinhala verbs was carried out by fernando2013morphological. Later, another study, which was restricted to morphological analysis of Sinhala verbs was conducted by dilshani2017corpus. There was no indication on whether this work was continued to cover other types of words. Further, other than this singular publication, no data or tools were made publicly accessible. nandathilaka2018rule proposed a rule based approach for Sinhala lemmatizing. An extremely simple plagiarism detection tool which only uses n-grams of simply tokenized text was proposed by basnayakeplagiarism. The work by virajdefining claim to have set a set of gold standard definitions for the morphology of Sinhala Words; but given that their results are not publicly available, further usage or confirmation of these claims cannot not be done. The table V provides a comparative summery of the discussion above.
Base: Rule-based (RB) / Machine Learning (ML)
Able to Handle Part of Speech (Handles): Yes (Y) / No (N)
Outputs: Yes (Y) / No (N) / No Information (O)
Abbreviations: Nouns (Nu), Verbs (Ve), Adjectives (Aj), Adverbs (Av), Function Words (Fn), Root (R), Person (P), Number (Nb), Gender (G), Article (A), Case (C)
|hettige2006morphological||RB||Finite State Automata||Y||Y||N||N||N||Y||Y||Y||Y||Y||Y|
|fernando2013morphological||RB||Finite State Transducer||N||Y||N||N||N||Y||Y||Y||Y||N||Y|
3.6 Part of Speech Taggers
The next step after morphological analysis is Part of Speech (PoS) tagging. The PoS tags differ in number and functionality from language to language. Therefore, the first step in creating an effective PoS tagger is to identifying the PoS tag set for the language. This work has been accomplished by fernando2016comprehensive and dilshani2017comprehensive. Expanding on that, fernando2016comprehensive has introduced an SVM Based PoS Tagger for Sinhala and then fernando2018evaluation give an evaluation of different classifiers for the task of Sinhala PoS tagging. While here it is obvious that there has been some follow up work after the initial foundation, it seems, all of that has been internal to one research group at one institution as neither the data nor the tools of any of these findings have been made available for the use of external researchers. Several attempts to create a stochastic PoS tagger for Sinhala has been done with the studies by herath2004stochastic, jayaweera2012evaluation, and jayasuriya2013learning being the most notable. Within a single group which did one of the above stochastic studies [jayaweera2012evaluation], yet another set of studies was carried out to create a Sinhala PoS tagger starting with the foundation of jayaweera2011part which then extended to a Hidden Markov Model (HMM) based approach [jayaweera2014hidden] and an analysis of unknown words [jayaweera2014unknown, jayaweera2014handling]. Further, this group presented a comparison of few Sinhala PoS taggers that are available to them [jayaweera2016comparison]. A RESTFul PoS tagging web service created by jayaweera2015restful using the above research can still be accessed2323 23 http://bit.ly/2F0jKid via POST and GET. A hybrid PoS tagger for Sinhala language was proposed by gunasekara2016hybrid. The study by kothalawala2019online discussed the data availability problem in NLP with a Sinhala POS tagging experiment among others.
The PoS tagged data then needs to be handed over to a parser. This is an area which is not completely solved even in English due to various inherent ambiguities in natural languages. However, in the case of English, there are systems which provide adequate results [manning2014the] even if not perfect yet. The Sinhala state of affairs, is that, the first parser for the Sinhala language was proposed by hettige2006parser with a model for grammar [hettige2011computational]. The study by liyanage2012computational is concentrated on the same given that they have worked on formalizing a computational grammar for Sinhala. While they do report reasonable results, yet again, do not provide any means for the public to access the data or the tools that they have developed. kanduboda2013usage have worked on Sinhala differential object markers relevant for parsing.
The first attempt at a Sinhala parser, as mentioned above, was by hettige2006parser where they created prototype Sinhala morphological analyzer and a parser as part of their larger project to build an end-to-end translator system. The function of the parser is based on three dictionaries: Base Dictionary, Rule Dictionary, and Concept Dictionary. They are built as follows:
The Base Dictionary: prakurthi (base words), nipatha (prepositions), upasarga (prefixes), and vibakthi (Irregular Verbs).
The Rule Dictionary: inflection rules used to generate various forms of verbs and nouns from the base words.
The Concept Dictionary: synonyms and antonyms for the words found in the base dictionary.
Parsers are, in essence, a computational representation of the grammar of a natural language. As such, in building Sinhala parsers, it is crucial to create a computational model for Sinhala grammar. The first such attempt was taken by hettige2011computational with special consideration given to Morphology and the Syntax of the Sinhala language as an extension to their earlier work [hettige2006parser]. Here, it is worthy to note that, unlike in their earlier attempt [hettige2006parser], where they explicitly mentioned that they are building a parser, in this study [hettige2011computational], they use the much conservative claim of building a computational grammar. Under Morphology, they again handled Sinhala inflection. Their system is based on a Finite State Transducer (FST) and Context-Free Grammar (CFG) where they they modeled 85 rules for nouns and 18 rules for verbs. The specific implementation is more partial to a rule-based composer rather than parser. It is also worthy to note that this system could only handle simple sentences which only contained the following 8 constituents: Attributive Adjunct of Subject, Subject, Attributive Adjunct of Object, Object, Attributive Adjunct of Predicate, Attributive Adjunct of the Complement of Predicate, Complement of Predicate, and Predicate. With these, they propose the following grammar rules for Sinhala:
S = Subject Akkyanaya Subject = SimpleSubject | ComplexSubject ComplexSubject = SimpleSubject ConSub SimpleSubject = Noun | Adjective Noun ConSub = Conjunction SimpleSubject Akkyanaya = VerbP | Object VerbP Object = SimpleObject | ComplexObject ComplexObject = Conjunction SimpleObject SimpleObject = Noun | Adjective Noun VerbP = Verb | Adverb Verb
The later work by liyanage2012computational also involves formalizing a computational grammar for Sinhala. They claim that Sinhala can have any order of words in practice. However, they do not note that this is happening because practices of the spoken language, which does not share the strong SOV conventions of the written language, are slowly seeping into written text. However, they do make note of how Sinhala grammar is modeled as a head-final language [karunatilaka1997sinhala]. They propose the Sinhala Noun Phrase () to be defined as shown in equation 1 where is a noun which can be of types: common noun (), pronoun () or proper noun (). The adjectival phrase () is then defined as as shown in equation 2 where: is a Determiner, is the adjective, and is an optional operator Degrees which can be used to intensify the meaning of the adjective in cases where the adjective is qualitative. While they note that according to gunasekara1891comprehensive, there has to three classes of adjectives (qualitative, quantitative, and demonstrative), they do not implement this distinction in their system. Similarly, they propose Sinhala Verb Phrase () to be defined as shown in equation 3 where is a single verb. They here note that they are ignoring compound verbs and auxiliary verbs in their grammar. The adverbial phrases () are then recursively defined as as shown in equation 4.
Similar to hettige2011computational, the work by liyanage2012computational also builds a CFG for Sinhala covering 10 out of the 25 types of simple sentence structures in Sinhala reported by Abhayasinghe1998sinhala. This parser is unable to parse sentences where inanimate subjects do not consider the number. Further, sentences which contain, compound verbs, auxiliary verbs, present participles, or past participles cannot be handled by this parser. If the verbs have imperative mood or negation those too cannot be handled by this. Non-verbal sentences which end with adjectives, oblique nominals, locative predicates, adverbials, or any other language entity which is not a verb cannot be handled by this parser.
The study by kanduboda2013usage covers not the whole of Sinhala parsing but analyzes a very specific property of Sinhala observed by aissen2003differential which states that it is possible to notice Differential Object Marking (DOM) in Sinhala active sentences. kanduboda2013usage define this as the choice of /wa/ and /ta/ object markers. They further observe three unique aspects of DOM in Sinhala: (a) it is only observed in active sentences which contain transitive verbs, (b) it can occur with accusative marked nouns but not with any other cases, (c) it exists only if the sentence has placed an animate noun in the accusative position. They do a statistical analysis and provide a number of short gazetteer lists as appendixes. However, they observe that further work has to be done for this particular language rule in Sinhala given that they found some examples which proved to be exceptions to the general model which they proposed.
3.8 Named Entity Recognition Tools
* Denotes a baseline.
to denotes Context Words, Word Prefixes and Suffixes, Length of the Word, Frequency of the Word, First Word/ Last Word of a Sentence, (POS) Tags, Gazetteer Lists, Clue Words, Outcome Prior, Previous Map, and Cutoff Value
As shown in Fig 1, once the text is properly parsed, it has to be processed using a Named Entity Recognition (NER) system. The first attempt of Sinahla NER was done by dahanayaka2014named. Given that they were conducting the first study for Sinhala NER, they based their approach on NER research done for other languages. In this, they gave prominent notice to that of Indic languages. On that matter, they were the first to make the interesting observation that NER for Indic languages (including, but not limited to Sinhala) is more difficult than that of English by the virtue of the absence of a capitalization mechanic. Following prior work done on other languages, they used Conditional Random Fields (CRF) as their main model and compared it against a baseline of a Maximum Entropy (ME) model. However, they only use the candidate word, Context Words around the candidate word, and a simple analysis of Sinhala suffixes as their features.
The follow up work by senevirathne2015conditional kept the CRF model with all the previous features but did not report comparative analysis with an ME model. The innovation introduced by this work is a richer set of features. In addition to the features used by dahanayaka2014named, they introduced, Length of the Word as a threshold feature. They also introduced First Word feature after observing certain rigid grammatical rules of Sinhala. A feature of clue Words in the form of a subset of Context Words feature was first proposed by this work. Finally, they introduced a feature for Previous Map which is essentially the NE value of the preceding word. Some of these feature extractions are done with the help of a rule-based post-processor which utilizes context-based word lists.
The third attempt at Sinhala NER was by manamini2016ananya who dubbed their system Ananya. They inherit the CRF model and ME baseline from the work of dahanayaka2014named. In addition to that, they take the enhanced feature list of senevirathne2015conditional and enrich it further more. They introduce a Frequency of the Word feature based on the assumption that most commonly occurring words are not NEs. Thus, they model this as a Boolean value with a threshold applied on the word frequency. They extend the First Word feature proposed by senevirathne2015conditional to a First Word/ Last Word of a Sentence feature noting that Sinhala grammar is of SOV configuration. They introduce a (PoS) Tag feature and a gazetteer lists based feature keeping in line with research done on NER in other languages. They formally introduce clue Words, which was initially proposed as a sub-feature by dahanayaka2014named, as an independent feature. Utilizing the fact that they have the ME model unlike dahanayaka2014named, they introduce a complementary feature to Previous Map named Outcome Prior, which uses the underlying distribution of the outcomes of the ME model. Finally, they introduce a Cutoff Value feature to handle the over-fitting problem.
The table VI provides comparative summery of the discussion above. It should be noted that all three of these models only tag NEs of types: person names, location names and organization names. The Ananya system by manamini2016ananya is available to download at GitHub 2424 24 http://bit.ly/2XrwCoK. The data and code for the approaches by dahanayaka2014named and by senevirathne2015conditional are not accessible to the public. azeez2020fine proposed a fine-grained NER model for Sinhala building on their earlier work on NER [manamini2016ananya] and PoS tagging [fernando2018evaluation].
3.9 Semantic Tools
Applications of the semantic layer are more advanced than the ones below it in Figure 1. But even with the obvious lack of resources and tools, a number of attempts have been made on semantic level applications for the Sinhala Language. The earliest attempt on semantic analysis was done by herath1990syntactic using their earlier work which dealt with Sinhala morphological analysis [herath1989sinhalese]. A Sinhala semantic similarity measure has been developed for short sentences by kadupitiya2016sinhala. This work has been then extended by kadupitiya2017sinhala for the application use case of short answer grading. Data and tools for these projects are not publicly available. A deterministic process flow for automatic Sinhala text summarizing was proposed by welgama2012automatic.
There have been multiple attempts to do word sense disambiguation (WSD) [yarowsky1992word, ide1998introduction, yarowsky1995unsupervised, banerjee2002adapted, navigli2009word] for Sinhala. For this, arukgoda2014word have proposed a system based on the Lesk Algorithm[lesk1986Automatic] while marasinghe2002word have proposed a system based on probabilistic modeling. A dialogue act recognition system which utilizes simple classification algorithms has been proposed by palihakkara2015dialogue.
Text classification is a popular application on the semantic layer of the NLP stack. A very basic Sinhala text classification using Naïve Bayes Classifier, Zipf’s Law Behavior, and SVMs was attempted by gallege2010analysis. A smaller implementation of Sinhala news classification has been attempted by de2015Sinhala. As mentioned in Section 3.2, their news corpus is publicly available2525 25 https://osf.io/tdb84/. A word2vec based tool2626 26 http://bit.ly/2QKI9Np for sentiment analysis of Sinhala news comments is available. Another attempt on Sinhala text classification using six popular rule based algorithms was done by lakmali2017effectiveness. Even-though they talk about building a corpus named SinNG5, they do not indicate of means for others to obtain the said corpus. Another study by kumari2019use utilizes the SinNG5 corpus as the data set for their attempt to use LIME [ribeiro2016should] for human interpretability of Sinhala document classification. However, they too do not provide access corpus. A methodology for constructing a sentiment lexicon for Sinhala Language in a semi-automated manner based on a given corpus was proposed by chathuranga2019sinhala. nanayakkara2018clustering have implemented a system which uses corpus-based similarity measures for Sinhala text classification. gunasekara2018context claim to have created a context aware stop word extraction method for Sinhala text classification based on simple TF-IDF. An LSTM based textual entailment system for Sinhala was proposed by jayasinghe2019deep. demotte2020sentiment proposed a sentiment analysis system based on sentence-state LSTM Networks for Sinhala news comments.
3.10 Phonological Tools
On the case of phonological layer, a report on Sinhala phonetics and phonology was published by wasala2005research. wickramasinghe2007practical discussed the practical issues in developing Sinhala Text-to-Speech and Speech Recognition systems. Based on the earlier work by weerasinghe2005rule, wasala2006sinhala have developed methods for Sinhala grapheme-to-phoneme conversion along with a set of rules for schwa epenthesis. This work was then extended by nadungodagesinhala. weerasinghe2007festival developed a Sinhala text-to-speech system. However, it is not publicly accessible. They internally extended it to create a system capable of helping a mute person achieve synthesized real-time interactive voice communication in Sinhala [amarasekara2013real]. A rule based approach for automatic segmentation of a small set of Sinhala text into syllables was proposed by kumara2007automatic. An ew prosodic phrasing method to help with Sinhala Text-to-Speech process was proposed by bandara2017ew, dias2009sinhala, bandara2013new. sodimana2018text proposed a text normalization methodology for Sinhala text-to-speech systems. Further, sodimana2018step formalized a step-by-step process for building text-to-speech voices for Sinhala. A separate group has done work on Sinhala text-to-speech systems independent to above [nanayakkarahuman].
On the converse, nadungodagespeech have done a series of work on Sinhala speech recognition with special notice given to Sinhala being a resource poor language. This project divides its focus on: continuity [nadungodage2011continuous], active learning [nadungodage2013efficient], and speaker adaptation [nadungodage2015speaker]. A Sinhala speech recognition for voice dialing which is speaker independent was proposed by amarasingha2012speaker and on the other end, a Sinhala speech recognition methodology for interactive voice response systems, which are accessed through mobile phones was proposed by manamperi2018sinhala. priyadarshani2012speaker proposes a method for speaker dependant speech recognition based on their previous work on: dynamic time warping for recognizing isolated Sinhala words [priyadarshani2012dynamic], genetic algorithms [priyadarshani2012genetic], and syllable segmentation method utilizing acoustic envelopes [priyadarshani2011automatic]. The method proposed by gunasekara2015real utilizes an HMM model for Sinhala speech-to-text. A Sinhala speech recognizer supporting bi-directional conversion between Unicode Sinhala and phonetic English was proposed by punchimudiyanse2015unicode. The work by karunanayake2019transfer transfer learns CNNs for transcribing free-form Sinhala and Tamil speech data sets for the purpose of classification.
The Sinhala speech classification system proposed by buddhika2018domain does so without converting the speech-to-text. However, they report that this approach only works for specific domains with well-defined limited vocabularies. kavmini2020improved presented a Sinhala speech command classification system which can be used for downstream applications.
3.11 Optical Character Recognition Applications
While it is not necessarily a component of the NLP stack shown in Fig 1, which follows the definition by liddy2001natural, it is possible to swap out the bottom most phonological layer of the stack in favour of an Optical Character Recognition (OCR) and text rendering layer.
An attempt for Sinhala OCR system has been taken by rajapakse1995neural before any other work has been done on the topic. Much later, a linear symmetry based approach was proposed by premaratne2002recognition, premaratne2004segmentation. They then used hidden Markov model-based optimization on the recognized Sinhala script [premaratne2006lexicon]. Similarly, hewavitharana2002off used hidden Markov models for off-line Sinhala character recognition. Statistical approaches with histogram projections for Sinhala character recognition is proposed by hewavitharana2002statistical, by ajward2010converting, and by madushanka2017sinhala. karunanayaka2004off also did off-line Sinhala character recognition with a use case for postal city name recognition. A separate group had attempted Sinhala OCR [weerasinghe2008nlp] mainly involving the nearest-neighbor method [weerasinghe2006nearest]. A study by ediriweera2012improviing uses dictionaries to correct errors in Sinhala OCR. An early attempt for Sinhala OCR by dias2013sinhala has been extended to be online and made available to use via desktops [dias2013online] and hand-held devices [ranmuthugala2006online] with the ability to recognize handwriting. A simple neural network based approach for Sinhala OCR was utilized by rimas2013optical. A fuzzy-based model for identifying printed Sinhala characters was proposed by gunarathna2014fuzzy. premachandra2016artificial proposes a simple back-propagation artificial neural network with hand crafted features for Sinhala character recognition. Another neural network with specialized feature extraction for Sinhala character recognition was proposed by naleer2016technique. On the matter of neural networks and feature extraction, a feature selection process for Sinhala OCR was proposed by kumara2016systematic. jayawickrama2018letter worked on Sinhala printed characters with special focus on handling diacritic vowels. However, they opted to refer to diacritic vowels as modifiers in their work. gunawardhana2018segmentation proposed a limited approach to recognize Sinhala letters on Facebook images.
fernando2003database claim to have created a database for handwriting recognition research in Sinhala language and further claims that the data set is available at National Science Foundation (NSF) of Sri Lanka. However, the paper provides no URLs and we were not able to find the data set on the NSF website either. The work by karunanayaka2005thresholding is focused on noise reduction and skew correction of Sinhala handwritten words. A genetic algorithm-based approach for non-cursive Sinhala handwritten script recognition was proposed by jayasekara2005non. nilaweera2007comparison compare projection and wavelet-based techniques for recognizing handwritten Sinhala script. silva2014segmenting worked on segmenting Sinhala handwritten characters with special focus on handling diacritic vowels. A comparative study of few available Sinhala handwriting recognition methods was done by silva2014state. silva2015contour uses contour tracing for isolated characters in handwritten Sinhala text. A Sinhala handwriting OCR system which utilizes zone-based feature extraction has been proposed by dharmapala2017sinhala. The study by walawage2018segmentation specifically focuses on segmentation of overlapping and touching Sinhala handwritten characters.
Summarizing on optically recognized old Sinhala text for the purpose of archival search and preservation was explored by rathnasena2018summarization. The work of peiris2012recognition also focused on OCR for ancient Sinhala inscriptions. A neural network based method for recognizing ancient Sinhala inscriptions was proposed by karunarathne2017recognizing. chanda2008word proposed a Gaussian kernel SVM based method for word-wise Sinhala, Tamil, and English script identification.
A series of work has been done by a group towards English to Sinhala translation as mentioned in some of the above subsections. This work includes; building a morphological analyzer [hettige2006morphological], lexicon databases [hettige2007developing], a transliteration system [hettige2007transliteration], an evaluation model [hettige2010evaluation], a computational model of grammar [hettige2011computational], and a multi-agent solution [hettige2016multi]. After working on human-assisted machine translation [hettige2007using], hettige2009theoretical, hettige2010varanageema have attempted to establish a theoretical basics for English to Sinhala machine translation. A very simplistic web based translator was proposed [hettige2008web, hettige2008web1]. The same group have worked on a Sinhala ontology generator for the purpose of machine translation [hettige2014sinhala] and a phrase level translator [hettige2017phrase] based on the previous work on a multi-agent system for translation [hettige2013masmt]. Further, an application of the English to Sinhala translator on the use case of selected text for reading was implemented [hettige2013selected].
Another group independently attempted English-to-Sinhala machine translation [liyanapathirana2011english] with a statistical approach [liyanapathirana2013statistical]. wijerathna2012translator and de2008sinhala have proposed simple rule based translators. An example-based method applied on the English-Sinhala sentence aligned government domain corpus was proposed by silva2008example. A translator based on a look-up system was proposed by vidanaralage2018sinhala. In a preprint, joseph2019evolutionary proposes an evolutionary algorithm for Sinhala to English translation with a basis of Point-wise Mutual Information (PMI) and claims that the code will be shared once the paper is accepted. However, they do not report any quantitative results to be compared and the reported qualitative results are also superficial.
Most of the cross Sinhala and Tamil work has been done in the domain of machine translation. A neural machine translation for Sinhala and Tamil languages was initiated by tennage2017neural, tennage2017neural1. Then they further enhanced it with transliteration and byte pair encoding [tennage2018transliteration] and used synthetic training data to handle the rare word problem [tennage2018handling]. This project produced Si-Ta [ranathunga2018si] a machine translation system of Sinhala and Tamil official documents. In the statistical machine translation front, farhath2018integration worked on integrating bilingual lists. The attempts by weerasinghe2003statistical and sripirakas2010statistical were also focused on statistical machine translation while jeyakaran2013novel attempted a kernel regression method. A yet another attempt was made by pushpananda2013towards which they later extended with some quality improvements [pushpananda2014sinhala]. An attempt on real-time direct translation between Sinhala and Tamil was done by rajpirathap2015real. dilshani2018linguistic have done a study on the linguistic divergence of Sinhala and Tamil languages in respect to machine translation. mokanarangan2019translation claims to have built a named entity translator between Sinhala and Tamil for official government documents. But this work is locked behind an institutional repository wall and thus is not accessible by other researchers. arukgoda2019improving studied the possibility of using deep learning techniques to improve Sinhala-Tamil translation. While not related to Tamil, there have been attempts to link Sinhala NLP with Japanese by Herath et al. [herath1994practical, herath1996bunsetsu, herath1993generation], thelijjagoda2004japanese, and kanduboda2011role. There has been an attempt to use dictionary-based machine translation [shalini2017dictionary] between Sinhala and the liturgical language of Buddhism, Pali [childers1875dictionary, salaville1938introduction, liddicoat1993choosing].
3.13 Miscellaneous Applications
In this section, we discuss NLP tools and research which are either hard to categorize under above sections or are reasonably involving multiples of them. The first miscellaneous application of Sinhala NLP is spell checking. The open-source data driven approach proposed by wasala2010data, wasala2011open claims to be able to check and correct spelling errors in Sinhala. The approach by jayalatharachchi2012data attempts to obtain synergy between two algorithms for the same purpose. These efforts [wasala2010data, jayalatharachchi2012data] were then extended by subhagya2018data.
On the matter of Sinhala sign language, strides have been made in the domains of computer interpreting for written Sinhala [punchimudiyanse2017computer] and animation of finger-spelled words and number signs [punchimudiyanse2017animation]. A simple Sinhala chat bot which utilizes a small knowledge base has been proposed by hettige2006first. fernando2011inexact proposed a method for inexact matching of Sinhala proper names. A study on determining canonical word order of colloquial Sinhala sentences using priority information was conducted by kanduboda2009priority which they later extended [kanduboda2010priority, tamaoka2011effects, kanduboda2012priority]. jayakody2016mahoshadha uses simple KNN and SVM methods on a PoS tagged Sinhala corpus to create a question-answering system which they name Mahoshadha. sandaruwan2020identification have attempted to identify abusive Sinhala comments in social media using text mining and machine learning techniques.
4 Primary Sources
Even though the main objective of this survey is to cover NLP tools and research, we noticed that much of these NLP tools and research depend on primary sources of Sinhala language such as printed books in the role of knowledge sources and ground truth. Therefore, for the benefit of other researchers who venture into Sinhala NLP, we decided to add a short introduction to the available primary sources of Sinhala language used by their peers. We note that the body of work by a single scholar, Disanayake [disanayaka1976national, disanayaka2000basaka, disanayaka2004basaka, disanayake2014sinhala, disanayaka2000basaka1, disanayaka2008basaka, disanayake2001basaka, disanayaka1991structure, disanayaka1985say, disanayaka2006sinhala, disanayaka2007usage, disanayaka1969Bashavaka, disanayaka1995grammar], is quite prominent in the case of being used for NLP applications. For formal introduction of the language, the books by disanayaka1976national and perera1985sinhala are commonly used. In cases which deal with the Sinhala alphabet, the introduction by indrasena2001sinhala and by Disanayake [disanayaka2000basaka, disanayaka2004basaka] have been used. An analysis of modern Sinhala linguistics has been done by jayathilake1991modern and by pallatthara1966sinhala. The early study by henadeerage2002topics covers a number of topics on the Sinhala language such as grammatical relations, argument structure, phrase structure and focus constructions.
As we discussed in Section 3.3, a number of printed Sinhala dictionaries exist, malalasekera1967english being the most prominent English-Sinahala dictionary among them. In addition to that seminal work, previous researchers of Sinhala NLP have utilized a number of other dictionaries of various configurations such as: English-Sinhala [gunaratne2006ratna, maitipe1988gunasena, ranaweera2004wasana], Sinhala-Sinhala [jayathilake1937sinhala, wijayathunga2003maha], and English-Sinhala-Tamil [weerasinghe1999godage].
A number of NLP applications have utilized first sources intended to teach children [dasanayaka1990kumara, dasanayaka2005kumara, fernando1994wara, fernando1994kriya, fernando1994sinhala] or foreigners [ranawake1986spoken, gunasekara1891comprehensive, gunasekara1999comprehensive]2727 27 Note that [gunasekara1999comprehensive] is an extension of [gunasekara1891comprehensive].. This makes sense given that an introduction written for children would start with basic principles and thus be ideal for crafting rule based NLP systems and an introduction written for foreigners would have Sinhala language described in terms of English, making easy the process of rule based translation of English NLP tools to Sinhala.
For applications where a rule based approach for Sinhala spelling correction is utilized, the books by disanayaka2006sinhala, disanayaka2007usage, by koparahewa2006dictionary, and by gair2006sinhala are used to provide a basis. A number of NLP applications which handle spoken Sinhala in the capacity of phonological layer (Section 3.10) applications or otherwise, make note of the fact that spoken Sinhala is considerably distinguishable from written Sinhala, as such, they refer primary sources which explicitly deal with spoken Sinhala [ranawake1986spoken, disanayaka1991structure, disanayaka1985say, karunatillake1990introduction, inman1986duration, fernando1994wara, disanayaka1995grammar].
Primary sources used in NLP application for Sinhala grammar are varied. A number of them provide overviews of the entirety of Sinhala grammar [munidasa1938vyakarana, pallatthara1966sinhala, gunasekara1986comprehensive, nie1989sinhala, jayathilake1991nuthana, sannasgala1995viyakaranavimansawa, balagalle1995bashaadauanayasaha, karunatilaka1997sinhala, karunatilaka2004sinhala, karunarathna2004sinahala, alwis2006niwaeradi, alwis2007niwaeradi, pereraPrayogika, disanayaka1969Bashavaka]. There are specific primary sources focusing on verbs [munidasa1993kriya, fernando1994kriya, disanayake2001basaka], nouns [fernando1994sinhala, disanayaka2008basaka], prepositions [fernando1994wara], compounds [disanayaka2000basaka1], derivation [disanayake2014sinhala], case system [jayawardana1989surface], and sentence structure [Abhayasinghe1998sinhala] of the Sinhala language. The book by rajapaksha2008sinhala is commonly used in NLP applications as a guide for word tagging and punctuation mark handling. NLP studies that tackle the hard problem of handling questions expressed in Sinhala often refer to the book by kariyakarawana1998syntax. kekulawala1972future has aptly discussed the much controversial topic of the situation of future tense of Sinhala.
At this point, a reader might think, there seems to be a significant number of implementations of NLP for Sinhala. Therefore, how can one justify listing Sinhala as a resource poor language? The important point which is missing in that assumption is that in the cases of almost all of the above listed implementations and findings, the only thing that is publicly available for a researcher is a set of research papers. The corpora, tools, algorithm, and anything else that were discovered through these research are either locked away as properties of individual research groups or worse lost to the time with crashed ancient servers, lost hard drives, and expired web hosts. This reason and probably academic/research rivalry have caused these separate research groups not to cite or build upon the works of each-other. In many cases where similar work is done, it is a re-hashing on the same ideas adopted from resource rich languages because of, the unavailability of (or the reluctance to), referring and building on work done by another group. This has resulted in multiple groups building multiple foundations behind closed doors but no one ending up with a completed end-to-end NLP work-flow. In conclusion, what can be said is that, even though there are islands of implementations done for Sinhala NLP, they are of very small scale and/or are usually not readily accessible for further use and research by other researchers. Thus, so far, sadly, Sinhala stays a resource poor language.
The authors would like to thank Romain Egele for checking the examples we have provided in French for their accuracy. Similarly, the authors would also like to thank Shravan Kale for checking the examples we have provided in Hindi for their accuracy.