Survey on Publicly Available Sinhala Natural Language Processing Tools and Research

  • 2019-07-05 06:33:19
  • Nisansa de Silva
  • 0

Abstract

Sinhala is the native language of the Sinhalese people who make up thelargest ethnic group of Sri Lanka. The language belongs to the globe-spanninglanguage tree, Indo-European. However, due to poverty in both linguistic andeconomic capital, Sinhala, in the perspective of Natural Language Processingtools and research, remains a resource-poor language which has neither theeconomic drive its cousin English has nor the sheer push of the law of numbersa language such as Chinese has. A number of research groups from Sri Lanka havenoticed this lack and the dire need for proper tools and research for Sinhalanatural language processing. However, due to various reasons, these attemptsseem to lack coordination and awareness of each other. The objective of thispaper is to fill that gap of a comprehensive literature survey of the publiclyavailable Sinhala natural language tools and research so that the researchersworking in this field can better utilize contributions of their peers. As such,we shall be uploading this paper to arXiv and perpetually update itperiodically to reflect the advances made in the field.

 

Quick Read (beta)

Survey on Publicly Available Sinhala Natural Language Processing Tools and Research

Nisansa de Silva Nisansa de Silva is with the Department of Computer Science & Engineering, University of Moratuwa.
E-mail: [email protected] Manuscript revised July 8, 2019.
Abstract
\justify

Sinhala is the native language of the Sinhalese people who make up the largest ethnic group of Sri Lanka. The language belongs to the globe-spanning language tree, Indo-European. However, due to poverty in both linguistic and economic capital, Sinhala, in the perspective of Natural Language Processing tools and research, remains a resource-poor language which has neither the economic drive its cousin English has nor the sheer push of the law of numbers a language such as Chinese has. A number of research groups from Sri Lanka have noticed this lack and the dire need for proper tools and research for Sinhala natural language processing. However, due to various reasons, these attempts seem to lack coordination and awareness of each other. The objective of this paper is to fill that gap of a comprehensive literature survey of the publicly available Sinhala natural language tools and research so that the researchers working in this field can better utilize contributions of their peers. As such, we shall be uploading this paper to arXiv and perpetually update it periodically to reflect the advances made in the field.

Sinhala, Natural Language Processing, Resource Poor Language

1 Introduction

Sinhala language, being the native language of the Sinhalese people [1], who make up the largest ethnic group of the island country of Sri Lanka, enjoys being reported as the mother tongue of Approximately 16 million people [2]. To give a brief linguistic background for the purpose of aligning the Sinhala language with the baseline of English, primarily it should be noted that Sinhala language belongs same the Indo-European language tree [3]. However, unlike English, which is part of the Germanic branch, Sinhala belongs to the Indo-Aryan branch. Further, Sinhala, unlike English, which borrowed the Latin alphabet, has its own writing system, which is a descendant of the Indian Brahmi script [4, 5, 6, 7, 8, 9]. By extension, this makes Sinhala Script a member of the Aramaic family of scripts [10, 11]. It should be noted that the modern Sinhala language have loanwords from languages such as Tamil, English, Portuguese, and Dutch due to various historical reasons. Regardless of the rich historical array of literature spanning several millennia (starting between 3rd to 2nd century BCE [12]), modern natural language processing tools for the Sinhala language are scarce [13].

Natural Language Processing (NLP) is a broad area covering all computational processing and analysis of human languages. To achieve this end, NLP systems operate at different levels [14, 15]. A graphical representation of NLP layers and application domains are shown in Figure 1. On one hand, according to Liddy [15], these systems can be categorized into the following layers; phonological, morphological, lexical, syntactic, semantic, discourse, and pragmatic. The phonological layer deals with the interpretation of language sounds. As such, it consists of mainly speech-to-text and text-to-speech systems. In cases where one is working with written text of the language rather than speech, it is possible to replace this layer with tools which handle Optical Character Recognition (OCR) and language rendering standards (such as Unicode [16]). The morphological layer analyses words at their smallest units of meaning. As such, analysis on word lemmas and prefix-suffix-based inflection are handled in this layer. Lexical layer handles individual words. Therefore tasks such as Part of Speech (PoS) tagging happens here. The next layer, syntactic, takes place at the phrase and sentence level where grammatical structures are utilized to obtain meaning. Semantic layer attempts to derive the meanings from the word level to the sentence level. Starting with Named Entity Recognition (NER) at the word level and working its way up by identifying the contexts they are set in until arriving at overall meaning. The discourse layer handles meaning in textual units larger than a sentence. In this, the function of a particular sentence maybe contextualized within the document it is set in. Finally, the pragmatic layer handles contexts read into contents without having to be explicitly mentioned [14, 15]. Some forms of anaphora (co-reference) resolution fall into this application.

Fig. 1: NLP layers and tasks [14]

On the other hand, Wimalasuriya and Dou [17] categorize NLP tools and research by utility. They introduce three categories with increasing complexity; Information Retrieval (IR), Information Extraction (IE), and Natural Language Understanding (NLU). Information Retrieval covers applications, which search and retrieve information which are relevant to a given query. For pure IR, tools and methods up-to and including the syntactic layer in the above analysis are used. Information Extraction, on the other hand, extracts structured information. The difference between IR and IE is the fact that IR does not change the structure of the documents in question. Be them structured, semi-structured, or unstructured, all IR does is fetching them as they are. In comparison, IE, takes semi-structured or unstructured text and puts them in a machine readable structure. For this, IE utilizes all the layers used by IR and the semantic layer. Natural Language Understanding is purely the idea of cognition. Most NLU tasks fall under AI-hard category and remain unsolved [14]. However, with varying accuracy, some NLU tasks such as machine translation11 1 This is, however, not without the criticism of being nothing more than a Chinese room [18] rather than true NLU. are being attempted. The pragmatic layer of the above analysis belongs to the NLU tasks while the discourse layer straddles information extraction and natural language understanding [14].

The objective of this paper is to serve as a comprehensive survey on the state of natural language processing resources for the Sinhala language. The initial structure and content of this survey are heavily influenced by the preliminary surveys carried out by de Silva [13] and Wijeratne et al. [14]. However, our hope is to host this survey at arXiv as a perpetually evolving work which continuously gets updated as new research and tools for Sinhala language are created and made publicly available. Hence, it is our hope that this work will help future researchers who are engaged in Sinhala NLP research to conduct their literature surveys efficiently and comprehensively. For the success of this survey, we shall also consider the Sri Lankan NLP tools repository, lknlp22 2 https://github.com/lknlp/lknlp.github.io.

The remainder of this survey is organized as follows; Section 2 discusses the various tools and research available for Sinhala NLP. In this section we would discuss both pure Sinhala NLP tools and research as well as hybrid Sinhala-English work. We will also discuss research and tools which contributes to Sinhala NLP either along with or by the help of Tamil, the other official language of Sri Lanka. Finally, Section 3, concludes the survey.

2 Sinhala NLP resources

In this section we generally follow the structure shown in Figure 1 for sectioning. However, in addition to that, we also discuss topics such as available corpora, other data sets, dictionaries, and WordNets. We focus on NLP tools and research rather than the mechanics of language script handling [19, 20, 21, 22, 23, 24]. One of the earliest attempts on Sinhala NLP was done by Herath et al. [25]. However, progress on that project has been minimal due to the limitations of their time. The later work by Nandasara [26] has not caught much of the advances done up to the time of its publication. Given that it was a decade old by the time the first edition of this survey was compiled, we observe the existence of many new discoveries in Sinhala NLP which have not been taken into account by it. A review on some challenges and opportunities in using Sinhala in computer science was done by Nandasara and Mikami [27]. At this point, it is worthy to note that the largest number of studies in Sinhala NLP has been on optical character recognition (OCR) rather than on higher levels of the hierarchy shown in Figure 1. On the other hand, the most prolific single project of Sinhala NLP we have observed so far is an attempt to create an end-to-end Sinhala-to-English translator [28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45].

Tamil, the other official language of Sri Lanka is also a resource poor language. However, due to the existence of larger populations of Tamil speakers worldwide, including but not limited to economic powerhouses such as India, there are more research and tools available for Tamil NLP tasks [14]. Therefore, it is rational to notice that Sinhala and Tamil NLP endeavours can help each other. Especially, given the above fact, that these are official languages of Sri Lanka, results in the generation of parallel data sets in the form of official government documents and local news items. A number of researchers make use of this opportunity. We shall be discussing those applications in this paper as well. Further, there have been some fringe implementations, which bridge Sinhala with other languages such as Japanese [46, 47, 48].

2.1 Corpora

For any language, the key for NLP applications and implementations is the existence of adequate corpora. On this matter, a relatively substantial Sinhala text corpus33 3 https://osf.io/a5quv/ was created by Upeksha et al. [49, 50] by web crawling. Later a smaller Sinhala newes corpus44 4 https://osf.io/tdb84/ was created by de Silva [13]. Both of the above corpora are publicly available. However, none of these come close to the massive capacity and range of the existing English corpora. A word corpus of approximately 35,000 entries was developed by Weerasinghe et al. [51]. But it does not seem to be online anymore. A number of Sinhala-English parallel corpora were introduced by Guzmán et al. [52]. This includes a 600k+ Sinhala-English subtitle pairs55 5 http://bit.ly/2KsFQxm initially collected by [53], 45k+ Sinhala-English sentence pairs from GNOME66 6 http://bit.ly/2Z8q0fo, KDE77 7 http://bit.ly/2WLY6bI, and Ubuntu88 8 http://bit.ly/2wLVZGtGuzmán et al. [52] further provided two monolingual corpora for Sinhala. Those were a 155k+ sentences of filtered Sinhala Wikipedia99 9 http://bit.ly/2EQZ7oM and 5178k+ sentences of Sinhala common crawl1010 10 http://bit.ly/2ZaQFZo.

As for Sinhala-Tamil corpora, Hameed et al. [54] claim to have built a sentence aligned Sinhala-Tamil parallel corpus and Mohamed et al. [55] claim to have built a word aligned Sinhala-Tamil parallel corpus. However, at the time of writing this paper, neither of them was publicly available. A very small Sinhala-Tamil aligned parallel corpus created by Farhath et al. [56] using order papers of government of Sri Lanka is available to download1111 11 http://bit.ly/2HTMEme.

2.2 Data Sets

Specific data sets for Sinhala, as expected, is scarce. However, a Sinhala PoS tagged data set [57, 58, 59] is available to download from github1212 12 http://bit.ly/2Krhrbv. Further, a Sinhala NER data set created by Manamini et al. [60] is also available to download from github1313 13 http://bit.ly/2XrwCoK.

Facebook has released FastText [61, 62, 63] models for the Sinhala language trained using the Wikipedia corpus. They are available as both text models1414 14 http://bit.ly/2JXAyL8 and binary files1515 15 http://bit.ly/2JY5J9c. Using the above models by Facebook, a group at University of Moratuwa has created an extended FastText model trained on Wikipedia, News, and official government documents. The binary file1616 16 http://bit.ly/2WowH0h of the trained model is available to be downloaded. Herath et al. [64] has compiled a report on the Sinhala lexicon for the purpose of establishing a basis for NLP applications.

2.3 Dictionaries

A necessary component for the purpose of bridging Sinhala and English resources are English-Sinhala dictionaries. The earliest and most extensive Sinhala-English dictionary available for consumption was by Malalasekera [65]. However, this dictionary is locked behind copyright laws and is not available for public research and development. The dictionary by Kulatunga [66] is publicly available for usage through an online web interface but does not provide API access or means to directly access the data set. The largest publicly available English-Sinhala dictionary data set is from a discontinued FireFox plug-in EnSiTip [67] which bears a more than passing resemblance to the above dictionary by Kulatunga [66]Hettige and Karunananda [31] claim to to have created a lexicon to help in their attempt to create a system capable of English-to-Sinhala machine translation. A review on the requirements for English-Sinhala smart bilingual dictionary was conducted by Samarawickrama and Hettige [68].

There exists the government sponsored trilingual dictionary [69], which matches Sinhala, English, and Tamil. However, other than a crude web interface on the ministry website, there is no efficient API or any other way for a researcher to access the data on this dictionary. Weerasinghe and Dias [70] have created a multilingual place name database for Sri Lanka which may function both as a dictionary and a resource for certain NER tasks.

2.4 WordNets

WordNets [71] are extremely powerful and act as a versatile component of many NLP applications. They encompass a number of linguistic properties which exist between the words in the lexicon of the language including but not limited to: hyponymy, hypernymy, synonymy, and meronymy. Their uses range from simple gazetteer listing applications [17] to information extraction based on semantic similarity [72, 73] or semantic oppositeness [74]. An attempt has been made to build a Sinhala Wordnet by Wijesiri et al. [75]. For a time it was hosted on [76] but it too is now defunct and all the data and applications are lost. However, even at its peak, due to the lack of volunteers for the crowd-soured methodology of populating the WordNet, it was at best an incomplete product. Another effort to build a Sinhala Wordnet was initiated by Welgama et al. [77] independently from above; but it too have stopped progression even before achieving the completion level of Wijesiri et al. [75].

2.5 Morphological Analyzers

As shown in Fig 1, morphological analysis is a ground level necessary component for natural language processing. Given that Sinhala is a highly highly inflected language [78, 79, 13], a proper morphological analysis process is vital. The earliest attempt on Sinhala morphological analysis we have observed is the study by Herath et al. [80]. The next attempt by Herath et al. [81] creates a modular unit structure for morphological analysis of Sinhala. Much later, as a step on their efforts to create a system with the ability to do English-to-Sinhala machine translation, Hettige and Karunananda [28] claim to have created a morphological analyzer, void of any public data or code. Hettige et al. [41] further propose a multi-agent System for morphological analysis. Welgama et al. [82] attempted to evaluate machine learning approaches for Sinhala morphological analysis. Yet another independent attempt to create a morphological parser for Sinhala verbs was carried out by Fernando and Weerasinghe [83]. Later, another study, which was restricted to morphological analysis of Sinhala verbs was conducted by Dilshani and Dias [84]. There was no indication on whether this work was continued to cover other types of words. Further, other than this singular publication, no data or tools were made publicly accessible. Nandathilaka et al. [85] proposed a rule based approach for Sinhala lemmatizing. An extremely simple plagiarism detection tool which only uses n-grams of simply tokenized text was proposed by Basnayake et al. [86]. The work by Viraj et al. [87] claim to have set a set of gold standard definitions for the morphology of Sinhala Words; but given that their results are not publicly available, further usage or confirmation of these claims cannot not be done.

2.6 Part of Speech Taggers

The next step after morphological analysis is Part of Speech (PoS) tagging. The PoS tags differ in number and functionality from language to language. Therefore, the first step in creating an effective PoS tagger is to identifying the PoS tag set for the language. This work has been accomplished by Fernando et al. [57] and Dilshani et al. [58]. Expanding on that, Fernando et al. [57] has introduced an SVM Based PoS Tagger for Sinhala and then Fernando and Ranathunga [59] give an evaluation of different classifiers for the task of Sinhala PoS tagging. While here it is obvious that there has been some follow up work after the initial foundation, it seems all of that has been internal to one research group at one institution as neither the data nor the tools of any of these findings have been made available for the use of external researchers. Several attempts to create a stochastic PoS tagger for Sinhala has been done with the studies by Herath and Weerasinghe [88]Jayaweera and Dias [89], and Jayasuriya and Weerasinghe [90] being the most notable. Within a single group which did one of the above stochastic studies [89], yet another set of studies was carried out to create a Sinhala PoS tagger starting with the foundation of Jayaweera and Dias [91] which then extended to a Hidden Markov Model (HMM) based approach [92] and an analysis of unknown words [93, 94]. Further, this group presented a comparison of few Sinhala PoS taggers that are available to them [95]. A RESTFul PoS tagging web service created by Jayaweera and Dias [96] using the above research can still be accessed1717 17 http://bit.ly/2F0jKid via POST and GET. A hybrid PoS tagger for Sinhala language was proposed by Gunasekara et al. [97].

2.7 Parsers

The PoS tagged data then needs to be handed over to a parser. This is an area which is not completely solved even in English due to various inherent ambiguities in natural languages. However, in the case of English, there are systems which provide adequate results [98] even if not perfect yet. A parser for the Sinhala language has been proposed by Hettige and Karunananda [29] with a model for grammar [39]. The study by Liyanage et al. [78] is concentrated on this layer given that they have worked on formalizing a computational grammar for Sinhala. A prosodic phrasing model for sinhala language has been implemented by Bandara et al. [99]. While they do report reasonable results, yet again, do not provide any means for the public to access the data or the tools that they have developed. Kanduboda and Prabath [79] have worked on Sinhala differential object markers relevant for parsing.

2.8 Named Entity Recognition Tools

As shown in Fig 1, once the text is properly parsed, it has to be processed using a Named Entity Recognition (NER) system. An NER system for Sinhla named Ananya has been developed by Manamini et al. [60] and is available to download at GitHub 1818 18 http://bit.ly/2XrwCoK. Another independent attempt on Sinhala NER has been done by Dahanayaka and Weerasinghe [100]; but data and code of that is not accessible to the public. A conditional random fields-based NER system was proposed by Senevirathne et al. [101].

2.9 Semantic Tools

Applications of the semantic layer is more advanced than the ones below it in Figure 1. But even with the obvious lack of resources and tools, a number of attempts have been made on semantic level applications for the Sinhala Language. The earliest attempt on semantic analysis was done by Herath et al. [102] using their earlier work which dealt with Sinhala morphological analysis [80]. A Sinhala semantic similarity measure has been developed for short sentences by Kadupitiya et al. [103]. This work has been then extended by Kadupitiya et al. [104] for the application use case of short answer grading. Data and tools for these projects are not publicly available. A deterministic process flow for automatic Sinhala text summarizing was proposed by Welgama [105].

There have been multiple attempts to do word sense disambiguation (WSD) for Sinhala. For this, Arukgoda et al. [106] have proposed a system based on Lesk Algorithm[107] while Marasinghe et al. [108] have proposed a system based on probabilistic modeling. A dialogue act recognition system which utilizes simple classification algorithms has been proposed by Palihakkara et al. [109].

Text classification is a popular application on the semantic layer of the NLP stack. A very basic Sinhala text classification using Naïve Bayes Classifier, Zipf’s Law Behavior, and SVMs was attempted by Gallege [110]. A smaller implementation of Sinhala news classification has been attempted by de Silva [13]. As mentioned above, their news corpus is publicly available1919 19 https://osf.io/tdb84/. A word2vec based tool2020 20 http://bit.ly/2QKI9Np for sentiment analysis of Sinhala news comments is available. Another attempt on Sinhala text classification using six popular rule based algorithms was done by Lakmali and Haddela [111]. Even-though they talk about building a corpus named SinNG5, they do not indicate of means for others to obtain the said corpus. Nanayakkara and Ranathunga [112] have implemented a system which uses corpus-based similarity measures for Sinhala text classification.  Gunasekara and Haddela [113] claim to have created a context aware stop word extraction method for Sinhala text classification based on simple TF-IDF.

2.10 Phonological Tools

On the case of phonological layer, a report on Sinhala phonetics and phonology was published by Wasala and Gamage [114]. Based on the earlier work by Weerasinghe et al. [115]Wasala et al. [116] have developed methods for Sinhala grapheme-to-phoneme conversion along with a set of rules for schwa epenthesis. This work was then extended by Nadungodage et al. [117]Weerasinghe et al. [118] developed a Sinhala text-to-speech system. However, it is not publicly accessible. They internally extended it to create a system capable of helping a mute person achieve synthesized real-time interactive voice communication in Sinhala [119]. A rule based approach for automatic segmentation of a small set of Sinhala text into syllables was proposed by Kumara et al. [120]. An ew prosodic phrasing method to help with Sinhala Text-to-Speech process was proposed by Bandara et al. [121, 122]Sodimana et al. [123] proposed a text normalization methodology for Sinhala text-to-speech systems. Further, Sodimana et al. [124] formalized a step-by-step process for building text-to-speech voices for Sinhala. A separate group has done work on Sinhala text-to-speech systems independent to above [125].

On the converse, Nadungodage et al. [126] have done a series of work on Sinhala speech recognition with special notice given to Sinhala being a resource poor language. This project divides its focus on: continuity [127], active learning [128], and speaker adaptation [129]. A Sinhala speech recognition for voice dialing which is speaker independent was proposed by Amarasingha and Gamini [130] and on the other end, a Sinhala speech recognition methodology for interactive voice response systems, which are accessed through mobile phones was proposed by Manamperi et al. [131]Priyadarshani [132] proposes a method for speaker dependant speech recognition based on their previous work on: dynamic time warping for recognizing isolated Sinhala words [133], genetic algorithms [134], and syllable segmentation method utilizing acoustic envelopes [135]. The method proposed by Gunasekara and Meegama [136] utilizes an HMM model for Sinhala speech-to-text. A Sinhala speech recognizer supporting bi-directional conversion between Unicode Sinhala and phonetic English was proposed by Punchimudiyanse and Meegama [137]. The Sinhala speech classification system proposed by [138] does so without converting the speech-to-text. However, they report that this approach only works for specific domains with well-defined limited vocabularies.

2.11 Optical Character Recognition Applications

While it is not necessarily a component of the NLP stack shown in Fig 1, which follows the definition by Liddy [15], it is possible to swap out the bottom most phonological layer of the stack in favour of an Optical Character Recognition (OCR) and text rendering layer.

An attempt for Sinhala OCR system has been taken by Rajapakse et al. [139] before any other work has been done on the topic. Much later, a linear symmetry based approach was proposed by Premaratne and Bigun [140, 141]. They then used hidden Markov model-based optimization on the recognized Sinhala script [142]. Similarly, Hewavitharana et al. [143] used hidden Markov models for off-line Sinhala character recognition. Statistical approaches with histogram projections for Sinhala character recognition is proposed by Hewavitharana and Kodikara [144], by Ajward et al. [145], and by Madushanka et al. [146].  Karunanayaka et al. [147] also did off-line Sinhala character recognition with a use case for postal city name recognition. A separate group had attempted Sinhala OCR [148] mainly involving the nearest-neighbor method [149]. A study by Ediriweera [150] uses dictionaries to correct errors in Sinhala OCR. An early attempt for Sinhala OCR by Dias et al. [151] has been extended to be online and made available to use via desktops [152] and hand-held devices [153] with the ability to recognize handwriting. A simple neural network based approach for Sinhala OCR was utilized by Rimas et al. [154]. A fuzzy-based model for identifying printed Sinhala characters was proposed by Gunarathna et al. [155]Premachandra et al. [156] proposes a simple back-propagation artificial neural network with hand crafted features for Sinhala character recognition. Another neural network with specialized feature extraction for Sinhala character recognition was proposed by Jayamaha and Naleer [157]. On the matter of neural networks and feature extraction, a feature selection process for Sinhala OCR was proposed by Kumara and Ragel [158].  Jayawickrama et al. [159] worked on Sinhala printed characters with special focus on handling diacritic vowels. However, they opted to refer to diacritic vowels as modifiers in their work. Gunawardhana and Ranathunga [160] proposed a limited approach to recognize Sinhala letters on Facebook images.

Fernando et al. [161] claim to have created a database for handwriting recognition research in Sinhala language and further claims that the data set is available at National Science Foundation (NSF) of Sri Lanka. However, the paper provides no URLs and we were not able to find the data set on the NSF website either. The work by Karunanayaka et al. [162] is focused on noise reduction and skew correction of Sinhala handwritten words. A genetic algorithm-based approach for non-cursive Sinhala handwritten script recognition was proposed by Jayasekara and Udawatta [163]Nilaweera et al. [164] compare projection and wavelet-based techniques for recognizing handwritten Sinhala script.  Silva and Kariyawasam [165] worked on segmenting Sinhala handwritten characters with special focus on handling diacritic vowels. A comparative study of few available Sinhala handwriting recognition methods was done by Silva et al. [166]Silva et al. [167] uses contour tracing for isolated characters in handwritten Sinhala text. A Sinhala handwriting OCR system which utilizes zone-based feature extraction has been proposed by Dharmapala et al. [168]. The study by Walawage and Ranathunga [169] specifically focuses on segmentation of overlapping and touching Sinhala handwritten characters.

Summarizing on optically recognized old Sinhala text for the purpose of archival search and preservation was explored by Rathnasena et al. [170]. The work of Peiris [171] also focused on OCR for ancient Sinhala inscriptions. A neural network based method for recognizing ancient Sinhala inscriptions was proposed by Karunarathne et al. [172].  Chanda et al. [173] proposed a Gaussian kernel SVM based method for word-wise Sinhala, Tamil, and English script identification.

2.12 Translators

A series of work has been done by a group towards English to Sinhala translation as mentioned in some of the above subsections. This work includes; building a morphological analyzer [28], lexicon databases [31], a transliteration system [32], an evaluation model [37], a computational model of grammar [39], and a multi-agent solution [44]. After working on human-assisted machine translation [33],  Hettige and Karunananda [36, 38] have attempted to establish a theoretical basics for English to Sinhala machine translation. A very simplistic web based translator was proposed [34, 35]. The same group have worked on a Sinhala ontology generator for the purpose of machine translation [43] and a phrase level translator [45] based on the previous work on a multi-agent system for translation [42]. Further, an application of the English to Sinhala translator on the use case of selected text for reading was implemented [40].

Another group independently attempted English-to-Sinhala machine translation [174] with a statistical approach [175]Wijerathna et al. [176] and De Silva et al. [177] have proposed simple rule based translators. An example-based method applied on the English-Sinhala sentence aligned government domain corpus was proposed by Silva and Weerasinghe [178]. A translator based on a look-up system was proposed by Vidanaralage et al. [179].

Most of the cross Sinhala and Tamil work has been done in the domain of machine translation. A neural machine translation for Sinhala and Tamil languages was initiated by Tennage et al. [180, 181]. Then they further enhanced it with transliteration and byte pair encoding [182] and used synthetic training data to handle the rare word problem [183]. This project produced Si-Ta [184] a machine translation system of Sinhala and Tamil official documents. In the statistical machine translation front, Farhath et al. [185] worked on integrating bilingual lists. The attempts by Weerasinghe [186] and Sripirakas et al. [187] were also focused on statistical machine translation while Jeyakaran [188] attempted a kernel regression method. A yet another attempt was made by Pushpananda et al. [189] which they later extended with some quality improvements [190]. An attempt on real-time direct translation between Sinhala and Tamil was done by Rajpirathap et al. [191]. While not related to Tamil, there have been attempts to link Sinhala NLP with Japanese by Herath et al. [46, 47]Thelijjagoda [192], and Kanduboda [48]. There has been an attempt to use dictionary-based machine translation [193] between Sinhala and the liturgical language of Buddhism, Pali.

2.13 Miscellaneous Applications

In this section, we discuss NLP tools and research which are either hard to categorize under above sections or are equally involving multiples of them. The first miscellaneous application of Sinhala NLP is spell checking. The open-source data driven approach proposed by Wasala et al. [194, 195] claims to be able to check and correct spelling errors in Sinhala. The approach by Jayalatharachchi et al. [196] attempts to obtain synergy between two algorithms for the same purpose. These efforts [194, 196] were then extended by Subhagya et al. [197].

On the matter of Sinhala sign language, strides have been made in the domains of computer interpreting for written Sinhala [198] and animation of finger-spelled words and number signs [199]. A simple Sinhala chat bot which utilizes a small knowledge base has been proposed by Hettige and Karunananda [30].  Fernando [200] proposed a method for inexact matching of Sinhala proper names.

3 Conclusion

At this point, a reader might think, there seems to be a significant number of implementations of NLP for Sinhala. Therefore, how can one justify listing Sinhala as a resource poor language? The important point which is missing in that assumption is that in the cases of almost all of the above listed implementations and findings, the only thing that is publicly available for a researcher is a set of research papers. The corpora, tools, algorithm, and anything else that were discovered through these research are either locked away as properties of individual research groups or worse lost to the time with crashed ancient servers, lost hard drives, and expired web hosts. This reason and probably academic/research rivalry have caused these separate research groups not to cite or build upon the works of each-other. In many cases where similar work is done, it is a re-hashing on the same ideas adopted from resource rich languages because of, the unavailability of (or the reluctance to), refer and build on work done by another group. This has resulted in multiple groups building multiple foundations behind closed doors but no one ending up with a completed end-to-end NLP work-flow. In conclusion, what can be said is even though there are islands of implementations done for Sinhala NLP, they are of very small scale and/or are usually not readily accessible for further use and research by other researchers. Thus, so far, sadly, Sinhala stays a resource poor language.

References

  • Bauer [2007] L. Bauer, Linguistics Student’s Handbook.    Edinburgh University Press, 2007.
  • [2] Department of Census and Statistics Sri Lanka. Percentage of population aged 10 years and over in major ethnic groups by district and ability to speak sinhala, tamil and english languages. [Online]. Available: https://goo.gl/nnVZSd
  • [3] H. Young. A language family tree - in pictures — education — the guardian. [Online]. Available: https://www.theguardian.com/education/gallery/2015/jan/23/a-language-family-tree-in-pictures
  • Bandara et al. [2012] D. Bandara, N. Warnajith, A. Minato, and S. Ozawa, “Creation of precise alphabet fonts of early brahmi script from photographic data of ancient sri lankan inscriptions,” Canadian Journal on Artificial Intelligence, Machine Learning and Pattern Recognition, vol. 3, no. 3, pp. 33–39, 2012.
  • Daniels and Bright [1996] P. T. Daniels and W. Bright, The world’s writing systems.    Oxford University Press on Demand, 1996.
  • Sirisoma [1990] M. H. Sirisoma, “Brahmi inscriptions of sri lanka from 3rd century bc to 65 ad,” pp. 3–54, 1990.
  • Dias [1996] M. Dias, “Lakdiwa sellipiwalin heliwana sinhala bhashawe prathyartha namayange vikashanaya,” Department of Archaeology, Colombo Sri Lanka, p. 1, 1996.
  • Hettiarachchi [1990] A. S. Hettiarachchi, “Investigation of 2nd, 3rd and 4th century inscriptions,” Inscriptions: Volume Two, Archaeological Department Centenary (1890–1990), Commemorative Series. Colombo: Department of Archaeology, pp. 57–104, 1990.
  • Paranavitana and Depārtamēntuva [1970] S. Paranavitana and S. L. P. Depārtamēntuva, Inscriptions of Ceylon.    Dept. of Archaeology, 1970.
  • Salomon [1998] R. Salomon, Indian epigraphy: a guide to the study of inscriptions in Sanskrit, Prakrit, and the other Indo-Aryan languages.    Oxford University Press, 1998.
  • Falk [1993] H. Falk, Schrift im alten Indien: ein Forschungsbericht mit Anmerkungen.    Gunter Narr Verlag, 1993, vol. 56.
  • Ray [2003] H. P. Ray, The archaeology of seafaring in ancient South Asia.    Cambridge University Press, 2003.
  • de Silva [2015] N. de Silva, “Sinhala Text Classification: Observations from the Perspective of a Resource Poor Language,” 2015.
  • Wijeratne et al. [2019] Y. Wijeratne, N. de Silva, and Y. Shanmugarajah, “Natural Language Processing for Government: Problems and Potential,” LIRNEasia, 2019.
  • Liddy [2001] E. D. Liddy, “Natural language processing,” 2001.
  • Consortium et al. [1996] U. Consortium et al., “The unicode standard: A technical introduction,” online document, http://www. unicode. org/unicode/standards/principles. html, 1996.
  • Wimalasuriya and Dou [2010] D. C. Wimalasuriya and D. Dou, “Ontology-based information extraction: An introduction and a survey of current approaches,” Journal of Information Science, vol. 36, no. 3, pp. 306–323, 2010.
  • Preston and Bishop [2002] J. Preston and M. J. M. Bishop, Views into the Chinese room: New essays on Searle and artificial intelligence.    OUP, 2002.
  • Samaranayake et al. [1989] V. K. Samaranayake, J. B. Disanayaka, and S. T. Nandasara, “A standard code for sinhala characters,” Proceedings, 9th Annual Sessions of the Computer Society of Sri Lanka, Colombo, 1989.
  • Samaranayake et al. [2003] V. K. Samaranayake, S. T. Nandasara, J. B. Disanayaka, A. R. Weerasinghe, and H. Wijayawardhana, “An introduction to unicode for sinhala characters,” University Of Colombo School of Computing, 2003.
  • Dias and Goonetilleke [2004] G. Dias and A. Goonetilleke, “Development of standards for Sinhala computing,” in 1st Regional Conference on ICT and E-Paradigms, 2004.
  • Dias [2005] G. V. Dias, “Challenges of enabling it in the sinhala language,” in 27th Internationalization and Unicode Conference, 2005.
  • Weerasinghe et al. [2006a] A. R. Weerasinghe, D. L. Herath, and K. Gamage, “The sinhala collation sequence and its representation in unicode,” Localization Focus, 2006.
  • Sandeva [2009] G. Sandeva, “Design and evaluation of user-friendly yet efficient sinhala input methods,” 2009.
  • Herath et al. [1991] S. Herath, S. Ishizaki, T. Ikeda, Y. Anzai, and H. Aiso, “Machine processing of sinhala natural language: a step toward intelligent systems,” Cybernetics and systems, vol. 22, no. 3, pp. 331–348, 1991.
  • Nandasara [2009] S. T. Nandasara, “From the past to the present: Evolution of computing in the sinhala language,” IEEE Annals of the History of Computing, vol. 31, no. 1, pp. 32–45, 2009.
  • Nandasara and Mikami [2016] S. T. Nandasara and Y. Mikami, “Bridging the digital divide in sri lanka: some challenges and opportunities in using sinhala in ict,” International Journal on Advances in ICT for Emerging Regions (ICTer), vol. 8, no. 1, 2016.
  • Hettige and Karunananda [2006a] B. Hettige and A. S. Karunananda, “A morphological analyzer to enable english to sinhala machine translation,” in Information and Automation, 2006. ICIA 2006. International Conference on.    IEEE, 2006, pp. 21–26.
  • Hettige and Karunananda [2006b] ——, “A parser for sinhala language-first step towards english to sinhala machine translation,” in Industrial and Information Systems, First International Conference on.    IEEE, 2006, pp. 583–587.
  • Hettige and Karunananda [2006c] ——, “First sinhala chatbot in action,” Proceedings of the 3rd Annual Sessions of Sri Lanka Association for Artificial Intelligence (SLAAI), University of Moratuwa, 2006.
  • Hettige and Karunananda [2007a] ——, “Developing lexicon databases for english to sinhala machine translation,” in Industrial and Information Systems, 2007. ICIIS 2007. International Conference on.    IEEE, 2007, pp. 215–220.
  • Hettige and Karunananda [2007b] ——, “Transliteration system for english to sinhala machine translation,” in Industrial and Information Systems, 2007. ICIIS 2007. International Conference on.    IEEE, 2007, pp. 209–214.
  • Hettige and Karunananda [2007c] ——, “Using human-assisted machine translation to overcome language barrier in sri lanka,” Proceedings of 4th Annual session of Sri Lanka Association for Artificial Intelligence, p. 10, 2007.
  • Hettige and Karunananda [2008a] ——, “Web-based english-sinhala translator in action,” in 2008 4th International Conference on Information and Automation for Sustainability.    IEEE, 2008, pp. 80–85.
  • Hettige and Karunananda [2008b] ——, “Web-based english to sinhala selected texts translation system,” Sri Lanka Association for Artificial Intelligence, p. 26, 2008.
  • Hettige and Karunananda [2009] ——, “Theoretical based approach to english to sinhala machine translation,” in 2009 International Conference on Industrial and Information Systems (ICIIS).    IEEE, 2009, pp. 380–385.
  • Hettige and Karunananda [2010a] ——, “An evaluation methodology for english to sinhala machine translation,” in Information and Automation for Sustainability (ICIAFs), 2010 5th International Conference on.    IEEE, 2010, pp. 31–36.
  • Hettige and Karunananda [2010b] ——, “Varanageema: A theoretical basics for english to sinhala machine translation,” in Sri Lanka Association for Artificial Intelligence (SLAAI), 2010.
  • Hettige and Karunananda [2011] ——, “Computational model of grammar for english to sinhala machine translation,” in Advances in ICT for Emerging Regions (ICTer), 2011 International Conference on.    IEEE, 2011, pp. 26–31.
  • Hettige et al. [2013a] B. Hettige, G. Rzevski, and A. S. Karunananda, “Selected text machine translator for english to sinhala,” 2013.
  • Hettige et al. [2012] B. Hettige, A. S. Karunananda, and G. Rzevski, “Multi-agent system technology for morphological analysis,” Proceedings of the 9th Annual Sessions of Sri Lanka Association for Artificial Intelligence (SLAAI), Colombo, 2012.
  • Hettige et al. [2013b] ——, “Masmt: A multi-agent system development framework for english-sinhala machine translation,” International Journal of Computational Linguistics and Natural Language Processing (IJCLNLP), vol. 2, no. 7, pp. 411–416, 2013.
  • Hettige et al. [2014] ——, “Sinhala ontology generator for english to sinhala machine translation,” in Proc. of KDU International Research Conference, 2014.
  • Hettige et al. [2016] ——, “A multi-agent solution for managing complexity in english to sinhala machine translation,” Complex Systems: Fundamentals & Applications, vol. 90, p. 251, 2016.
  • Hettige et al. [2017] ——, “Phrase-level english to sinhala machine translation with multi-agent approach,” in 2017 IEEE International Conference on Industrial and Information Systems (ICIIS).    IEEE, 2017, pp. 1–6.
  • Herath et al. [1994] A. Herath, Y. Hyodo, Y. Kawada, T. Ikeda, and S. Herath, “A practical machine translation system from japanese to modern sinhalese,” Gifu University, pp. 153–162, 1994.
  • Herath et al. [1996] A. Herath, Y. Hyodo, Y. Kunieda, T. Ikeda, and S. Herath, “Bunsetsu-based japanese-sinhalese translation system,” Information sciences, vol. 90, no. 1-4, pp. 303–319, 1996.
  • Kanduboda [2011] A. B. Kanduboda, “The role of animacy in determining noun phrase cases in the sinhalese and japanese languages,” Science of words, vol. 24, pp. 5–20, 2011.
  • Upeksha et al. [2015a] D. Upeksha, C. Wijayarathna, M. Siriwardena, L. Lasandun, C. Wimalasuriya, N. H. N. D. de Silva, and G. Dias, “Implementing a Corpus for Sinhala Language,” in Symposium on Language Technology for South Asia 2015, 2015.
  • Upeksha et al. [2015b] ——, “Comparison between performance of various database systems for implementing a language corpus,” in International Conference: Beyond Databases, Architectures and Structures.    Springer, May 2015, pp. 82–91.
  • Weerasinghe et al. [2009] R. Weerasinghe, D. Herath, and V. Welgama, “Corpus-based sinhala lexicon,” in Proceedings of the 7th Workshop on Asian Language Resources.    Association for Computational Linguistics, 2009, pp. 17–23.
  • Guzmán et al. [2019] F. Guzmán, P.-J. Chen, M. Ott, J. Pino, G. Lample, P. Koehn, V. Chaudhary, and M. Ranzato, “Two new evaluation datasets for low-resource machine translation: Nepali-english and sinhala-english,” arXiv preprint arXiv:1902.01382, 2019.
  • Lison and Tiedemann [2016] P. Lison and J. Tiedemann, “Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles,” 2016.
  • Hameed et al. [2016] R. A. Hameed, N. Pathirennehelage, A. Ihalapathirana, M. Z. Mohamed, S. Ranathunga, S. Jayasena, G. Dias, and S. Fernando, “Automatic creation of a sentence aligned sinhala-tamil parallel corpus,” in Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016), 2016, pp. 124–132.
  • Mohamed et al. [2017] M. Z. Mohamed, A. Ihalapathirana, R. A. Hameed, N. Pathirennehelage, S. Ranathunga, S. Jayasena, and G. Dias, “Automatic creation of a word aligned sinhala-tamil parallel corpus,” in Engineering Research Conference (MERCon), 2017 Moratuwa.    IEEE, 2017, pp. 425–430.
  • Farhath et al. [2018a] F. Farhath, P. Theivendiram, S. Ranathunga, S. Jayasena, and G. Dias, “Improving domain-specific smt for low-resourced languages using data from different domains,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), 2018.
  • Fernando et al. [2016] S. Fernando, S. Ranathunga, S. Jayasena, and G. Dias, “Comprehensive part-of-speech tag set and svm based pos tagger for sinhala,” in Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016), 2016, pp. 173–182.
  • Dilshani et al. [2017] N. Dilshani, S. Fernando, S. Ranathunga, S. Jayasena, and G. Dias, “A comprehensive part of speech (pos) tag set for sinhala language.”    The Third International Conference on Linguistics in Sri Lanka, ICLSL 2017 …, 2017.
  • Fernando and Ranathunga [2018] S. Fernando and S. Ranathunga, “Evaluation of different classifiers for sinhala pos tagging,” in 2018 Moratuwa Engineering Research Conference (MERCon).    IEEE, 2018, pp. 96–101.
  • Manamini et al. [2016] S. A. P. M. Manamini, A. F. Ahamed, R. A. E. C. Rajapakshe, G. H. A. Reemal, S. Jayasena, G. V. Dias, and S. Ranathunga, “Ananya-a named-entity-recognition (ner) system for sinhala language,” in Moratuwa Engineering Research Conference (MERCon), 2016.    IEEE, 2016, pp. 30–35.
  • Joulin et al. [2016] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov, “Fasttext. zip: Compressing text classification models,” arXiv preprint arXiv:1612.03651, 2016.
  • Bojanowski et al. [2017] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017.
  • Joulin et al. [2017] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for efficient text classification,” in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 2017, pp. 427–431.
  • [64] D. Herath, K. Gamage, and A. Malalasekara, “Research report on sinhala lexicon,” Langugae Technology Research Laboratory, UCSC.
  • Malalasekera [1967] G. P. Malalasekera, “English-sinhalese dictionary.” 1967.
  • [66] M. Kulatunga. Madura english-sinhala dictionary - online language translator. [Online]. Available: https://maduraonline.com/
  • Wasala and Weerasinghe [2008] A. Wasala and R. Weerasinghe, “Ensitip: a tool to unlock the english web,” in 11th international conference on humans and computers, Nagaoka University of Technology, Japan, 2008, pp. 20–23.
  • [68] L. Samarawickrama and B. Hettige, “Requirements for an english-sinhala smart bilingual dictionary: A review.”
  • [69] Department of Official Languages, Sri Lanka. Tri-lingual dictionary. [Online]. Available: https://www.trilingualdictionary.lk/
  • Weerasinghe and Dias [2013] A. Weerasinghe and G. Dias, “Construction of a multilingual place name database for sri lanka,” 2013.
  • Miller [1995] G. A. Miller, “Wordnet: a lexical database for english,” Communications of the ACM, vol. 38, no. 11, pp. 39–41, 1995.
  • Wu and Palmer [1994] Z. Wu and M. Palmer, “Verbs semantics and lexical selection,” in Proceedings of the 32nd annual meeting on Association for Computational Linguistics.    Association for Computational Linguistics, 1994, pp. 133–138.
  • Jiang and Conrath [1997] J. J. Jiang and D. W. Conrath, “Semantic similarity based on corpus statistics and lexical taxonomy,” in Proc of 10th International Conference on Research in Computational Linguistics, ROCLING’97.    Citeseer, 1997.
  • de Silva et al. [2017] N. de Silva, D. Dou, and J. Huang, “Discovering inconsistencies in pubmed abstracts through ontology-based information extraction,” in Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics.    ACM, 2017, pp. 362–371.
  • Wijesiri et al. [2014] I. Wijesiri, M. Gallage, B. Gunathilaka, M. Lakjeewa, D. Wimalasuriya, G. Dias, R. Paranavithana, and N. de Silva, “Building a wordnet for Sinhala,” in Proceedings of the Seventh Global WordNet Conference, 2014, pp. 100–108.
  • [76] Sinhala wordnet. [Online]. Available: http://www.wordnet.lk/
  • Welgama et al. [2011] V. Welgama, D. L. Herath, C. Liyanage, N. Udalamatta, R. Weerasinghe, and T. Jayawardana, “Towards a sinhala wordnet,” in Proceedings of the Conference on Human Language Technology for Development, 2011.
  • Liyanage et al. [2012] C. Liyanage, R. Pushpananda, D. L. Herath, and R. Weerasinghe, “A computational grammar of Sinhala,” in International Conference on Intelligent Text Processing and Computational Linguistics.    Springer, 2012, pp. 188–200.
  • Kanduboda and Prabath [2013] A. Kanduboda and B. Prabath, “On the usage of sinhalese differential object markers object marker /wa/ vs. object marker /ta/,” Theory and Practice in Language Studies, vol. 3, no. 7, p. 1081, 2013.
  • Herath et al. [1989] S. Herath, T. Ikeda, S. Yokoyama, H. Isahara, and S. Ishizaki, “Sinhalese morphological analysis: a step towards machine processing of sinhalese,” in [Proceedings 1989] IEEE International Workshop on Tools for Artificial Intelligence.    IEEE, 1989, pp. 100–107.
  • Herath et al. [1992] S. Herath, T. Ikeda, S. Ishizaki, Y. Anzai, and H. Aiso, “Analysis system for sinhalese unit structure,” Journal of Experimental & Theoretical Artificial Intelligence, vol. 4, no. 1, pp. 29–48, 1992.
  • Welgama et al. [2013] V. Welgama, R. Weerasinghe, and M. Niranjan, “Evaluating a machine learning approach to sinhala morphological analysis,” in Proceedings of the 10th International Conference on Natural Language Processing, Noida, India, 2013.
  • Fernando and Weerasinghe [2013] N. Fernando and R. Weerasinghe, “A morphological parser for sinhala verbs,” in Proceedings of the International Conference on Advances in ICT for Emerging Regions, 2013.
  • Dilshani and Dias [2017] W. S. N. Dilshani and G. Dias, “A corpus-based morphological analysis of sinhala verbs.”    The Third International Conference on Linguistics in Sri Lanka, ICLSL 2017 …, 2017.
  • Nandathilaka et al. [2018] M. Nandathilaka, S. Ahangama, and G. T. Weerasuriya, “A rule-based lemmatizing approach for sinhala language,” in 2018 3rd International Conference on Information Technology Research (ICITR).    IEEE, 2018, pp. 1–5.
  • [86] S. Basnayake, H. Wijekoon, and T. K. Wijayasiriwardhane, “Plagiarism detection in sinhala language: A software approach.”
  • [87] W. Viraj, W. Ruvan, and M. Niranjan, “Defining the gold standard definitions for the morphology of sinhala words.”
  • Herath and Weerasinghe [2004] D. L. Herath and A. R. Weerasinghe, “A stochastic part of speech tagger for sinhala,” in Proceedings of the 06th International Information Technology Conference, 2004, pp. 27–28.
  • Jayaweera and Dias [2012] A. J. P. M. P. Jayaweera and N. G. J. Dias, “Evaluation of stochastic based tagging approach for sinhala language,” 2012.
  • Jayasuriya and Weerasinghe [2013] M. Jayasuriya and A. R. Weerasinghe, “Learning a stochastic part of speech tagger for sinhala,” in Advances in ICT for Emerging Regions (ICTer), 2013 International Conference on.    IEEE, 2013, pp. 137–143.
  • Jayaweera and Dias [2011] A. J. P. M. P. Jayaweera and N. G. J. Dias, “Part of speech (pos) tagger for sinhala language,” 2011.
  • Jayaweera and Dias [2014a] ——, “Hidden markov model based part of speech tagger for sinhala language,” arXiv preprint arXiv:1407.2989, 2014.
  • Jayaweera and Dias [2014b] ——, “Unknown words analysis in pos tagging of sinhala language,” in Advances in ICT for Emerging Regions (ICTer), 2014 International Conference on.    IEEE, 2014, pp. 270–270.
  • Jayaweera and Dias [2014c] ——, “Handling issues with unknown words in pos tagging.”    Book of Abstracts, Annual Research Symposium 2014, 2014.
  • Jayaweera and Dias [2016] M. Jayaweera and N. G. J. Dias, “Comparison of part of speech taggers for sinhala language,” 2016.
  • Jayaweera and Dias [2015] A. J. P. M. P. Jayaweera and N. G. J. Dias, “Restful pos tagging web service for sinhala language,” in 2015 Fifteenth International Conference on Advances in ICT for Emerging Regions (ICTer).    IEEE, 2015, pp. 50–57.
  • Gunasekara et al. [2016] D. Gunasekara, W. V. Welgama, and A. R. Weerasinghe, “Hybrid part of speech tagger for sinhala language,” in Advances in ICT for Emerging Regions (ICTer), 2016 Sixteenth International Conference on.    IEEE, 2016, pp. 41–48.
  • Manning et al. [2014] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky, “The Stanford CoreNLP natural language processing toolkit,” in Association for Computational Linguistics (ACL) System Demonstrations, 2014, pp. 55–60. [Online]. Available: http://www.aclweb.org/anthology/P/P14/P14-5010
  • Bandara et al. [2013] W. M. C. Bandara, V. M. S. Lakmal, T. D. Liyanagama, S. V. Bulathsinghala, G. Dias, and S. Jayasena, “A new prosodic phrasing model for sinhala language,” 2013.
  • Dahanayaka and Weerasinghe [2014] J. K. Dahanayaka and A. R. Weerasinghe, “Named entity recognition for sinhala language,” in Advances in ICT for Emerging Regions (ICTer), 2014 International Conference on.    IEEE, 2014, pp. 215–220.
  • Senevirathne et al. [2015] K. U. Senevirathne, N. S. Attanayake, A. W. M. H. Dhananjanie, W. A. S. U. Weragoda, A. Nugaliyadde, and S. Thelijjagoda, “Conditional random fields based named entity recognition for sinhala,” in 2015 IEEE 10th International Conference on Industrial and Information Systems (ICIIS).    IEEE, 2015, pp. 302–307.
  • Herath et al. [1990] S. Herath, S. Ishizaki, T. Ikeda, Y. Anzai, and H. Aiso, “Syntactic and semantic analysis of sinhala: a step towards intelligence computing systems,” in Proceedings. 5th IEEE International Symposium on Intelligent Control 1990.    IEEE, 1990, pp. 316–324.
  • Kadupitiya et al. [2016] J. C. S. Kadupitiya, S. Ranathunga, and G. Dias, “Sinhala short sentence similarity calculation using corpus-based and knowledge-based similarity measures,” in Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016), 2016, pp. 44–53.
  • Kadupitiya et al. [2017] ——, “Sinhala short sentence similarity measures using corpus-based simi-larity for short answer grading,” in 6th Workshop on South and Southeast Asian Natural Language Processing, 2017, pp. 44–53.
  • Welgama [2012] W. V. Welgama, “Automatic text summarization for sinhala,” 2012.
  • Arukgoda et al. [2014] J. Arukgoda, V. Bandara, S. Bashani, V. Gamage, and D. Wimalasuriya, “A word sense disambiguation technique for sinhala,” in 2014 4th International Conference on Artificial Intelligence with Applications in Engineering and Technology.    IEEE, 2014, pp. 207–211.
  • Lesk [1986] M. Lesk, “Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone,” in Proceedings of the 5th annual international conference on Systems documentation.    Citeseer, 1986, pp. 24–26.
  • Marasinghe et al. [2002] C. Marasinghe, S. Herath, and A. Herath, “Word sense disambiguation of sinhala language with unsupervised learning,” in Proc. International Conference on Information Technology and Applications, 2002, pp. 25–29.
  • Palihakkara et al. [2015] S. Palihakkara, D. Sahabandu, A. Shamsudeen, C. Bandara, and S. Ranathunga, “Dialogue act recognition for text-based sinhala,” in Proceedings of the 12th International Conference on Natural Language Processing, 2015, pp. 367–375.
  • Gallege [2010] S. Gallege, “Analysis of sinhala using natural language processing techniques,” 2010.
  • Lakmali and Haddela [2017] K. B. N. Lakmali and P. S. Haddela, “Effectiveness of rule-based classifiers in sinhala text categorization,” in 2017 National Information Technology Conference (NITC).    IEEE, 2017, pp. 153–158.
  • Nanayakkara and Ranathunga [2018] P. Nanayakkara and S. Ranathunga, “Clustering sinhala news articles using corpus-based similarity measures,” in 2018 Moratuwa Engineering Research Conference (MERCon).    IEEE, 2018, pp. 437–442.
  • Gunasekara and Haddela [2018] S. V. S. Gunasekara and P. S. Haddela, “Context aware stopwords for sinhala text classification,” in 2018 National Information Technology Conference (NITC).    IEEE, 2018, pp. 1–6.
  • Wasala and Gamage [2005] A. Wasala and K. Gamage, “Research report on phonetics and phonology of sinhala,” Language Technology Research Laboratory, University of Colombo School of Computing, vol. 35, 2005.
  • Weerasinghe et al. [2005] R. Weerasinghe, A. Wasala, and K. Gamage, “A rule based syllabification algorithm for sinhala,” in International Conference on Natural Language Processing.    Springer, 2005, pp. 438–449.
  • Wasala et al. [2006] A. Wasala, R. Weerasinghe, and K. Gamage, “Sinhala grapheme-to-phoneme conversion and rules for schwa epenthesis,” in Proceedings of the COLING/ACL on Main conference poster sessions.    Association for Computational Linguistics, 2006, pp. 890–897.
  • Nadungodage et al. [a] T. Nadungodage, C. Liyanage, A. Prerera, R. Pushpananda, and R. Weerasinghe, “Sinhala g2p conversion for speech processing,” in Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages, pp. 112–116.
  • Weerasinghe et al. [2007] R. Weerasinghe, A. Wasala, V. Welgama, and K. Gamage, “Festival-si: A sinhala text-to-speech system,” in International Conference on Text, Speech and Dialogue.    Springer, 2007, pp. 472–479.
  • Amarasekara et al. [2013] M. S. Amarasekara, K. M. N. S. Bandara, B. V. A. I. Vithana, D. H. De Silva, and A. Jayakody, “Real-time interactive voice communication-for a mute person in sinhala (rtivc),” in 2013 8th International Conference on Computer Science & Education.    IEEE, 2013, pp. 671–675.
  • Kumara et al. [2007] K. H. Kumara, N. G. J. Dias, and H. Sirisena, “Automatic segmentation of given set of sinhala text into syllables for speech synthesis,” pp. 53–62, 2007.
  • Bandara et al. [2017] W. M. C. Bandara, W. M. S. Lakmal, T. D. Liyanagama, S. Bulathsinghala, G. Dias, and S. Jayasena, “A ew prosodic phrasing method for sinhala language,” 2017.
  • Bandara et al. [2009] W. M. C. Bandara, S. V. Bulathsinghala, W. M. S.Lakmal, T. D. Liyanagama, G. Dias, and S. Jayasena, “Sinhala text to speech system,” 2009.
  • Sodimana et al. [2018a] K. Sodimana, P. De Silva, R. Sproat, A. Theeraphol, C. F. Li, A. Gutkin, S. Sarin, and K. Pipatsrisawat, “Text normalization for bangla, khmer, nepali, javanese, sinhala, and sundanese text-to-speech systems,” 2018.
  • Sodimana et al. [2018b] K. Sodimana, K. Pipatsrisawat, L. Ha, M. Jansche, O. Kjartansson, P. De Silva, and S. Sarin, “A step-by-step process for building tts voices using open source data and framework for bangla, javanese, khmer, nepali, sinhala, and sundanese,” 2018.
  • [125] L. Nanayakkara, C. Liyanage, P.-T. Viswakula, T. Nagungodage, R. Pushpananda, and R. Weerasinghe, “A human quality text to speech system for sinhala,” in Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages, pp. 157–161.
  • Nadungodage et al. [b] T. Nadungodage, R. Weerasinghe, and M. Niranjan, “Speech recognition for low resourced languages: Efficient use of training data for sinhala speech recognition by active learning.”
  • Nadungodage and Weerasinghe [2011] T. Nadungodage and R. Weerasinghe, “Continuous sinhala speech recognizer,” in Conference on Human Language Technology for Development, Alexandria, Egypt, 2011, pp. 2–5.
  • Nadungodage et al. [2013] T. Nadungodage, R. Weerasinghe, and M. Niranjan, “Efficient use of training data for Sinhala speech recognition using active learning,” in Advances in ICT for Emerging Regions (ICTer), 2013 International Conference on.    IEEE, 2013, pp. 149–153.
  • Nadungodage et al. [2015] ——, “Speaker Adaptation Applied to Sinhala Speech Recognition.” Int. J. Comput. Linguistics Appl., vol. 6, no. 1, pp. 117–129, 2015.
  • Amarasingha and Gamini [2012] W. G. T. N. Amarasingha and D. D. A. Gamini, “Speaker independent sinhala speech recognition for voice dialling,” in International Conference on Advances in ICT for Emerging Regions (ICTer2012).    IEEE, 2012, pp. 3–6.
  • Manamperi et al. [2018] W. Manamperi, D. Karunathilake, T. Madhushani, N. Galagedara, and D. Dias, “Sinhala speech recognition for interactive voice response systems accessed through mobile phones,” in 2018 Moratuwa Engineering Research Conference (MERCon).    IEEE, 2018, pp. 241–246.
  • Priyadarshani [2012] P. G. N. Priyadarshani, “Speaker dependent speech recognition on a selected set of sinhala words,” 2012.
  • Priyadarshani et al. [2012a] P. G. N. Priyadarshani, N. G. J. Dias, and A. Punchihewa, “Dynamic time warping based speech recognition for isolated sinhala words,” in 2012 IEEE 55th International Midwest Symposium on Circuits and Systems (MWSCAS).    IEEE, 2012, pp. 892–895.
  • Priyadarshani et al. [2012b] ——, “Genetic algorithm approach for sinhala speech recognition,” in 2012 IEEE 55th International Midwest Symposium on Circuits and Systems (MWSCAS).    IEEE, 2012, pp. 896–899.
  • Priyadarshani and Dias [2011] P. G. N. Priyadarshani and N. G. J. Dias, “Automatic segmentation of separately pronounced sinhala words into syllables,” 2011.
  • Gunasekara and Meegama [2015] M. K. H. Gunasekara and R. G. N. Meegama, “Real-time translation of discrete sinhala speech to unicode text,” in 2015 Fifteenth International Conference on Advances in ICT for Emerging Regions (ICTer).    IEEE, 2015, pp. 140–145.
  • Punchimudiyanse and Meegama [2015] M. Punchimudiyanse and R. G. N. Meegama, “Unicode sinhala and phonetic english bi-directional conversion for sinhala speech recognizer,” in 2015 IEEE 10th International Conference on Industrial and Information Systems (ICIIS).    IEEE, 2015, pp. 296–301.
  • Buddhika et al. [2018] D. Buddhika, R. Liyadipita, S. Nadeeshan, H. Witharana, S. Javasena, and U. Thayasivam, “Domain specific intent classification of sinhala speech data,” in 2018 International Conference on Asian Language Processing (IALP).    IEEE, 2018, pp. 197–202.
  • Rajapakse et al. [1995] R. K. Rajapakse, A. R. Weerasinghe, and E. K. Seneviratne, “A neural network based character recognition system for sinhala script,” Department of Statistics and Computer Science, University of Colombo, 1995.
  • Premaratne and Bigun [2002] H. L. Premaratne and J. Bigun, “Recognition of printed sinhala characters using linear symmetry,” in The 5th Asian Conference on Computer Vision, 2002, pp. 23–25.
  • Premaratne and Bigun [2004] ——, “A segmentation-free approach to recognise printed sinhala script using linear symmetry,” Pattern recognition, vol. 37, no. 10, pp. 2081–2089, 2004.
  • Premaratne et al. [2006] H. L. Premaratne, E. Järpe, and J. Bigun, “Lexicon and hidden markov model-based optimisation of the recognised sinhala script,” Pattern recognition letters, vol. 27, no. 6, pp. 696–705, 2006.
  • Hewavitharana et al. [2002] S. Hewavitharana, H. C. Fernando, and N. D. Kodikara, “Off-line sinhala handwriting recognition using hidden markov models.” in ICVGIP, 2002.
  • Hewavitharana and Kodikara [2002] S. Hewavitharana and N. D. Kodikara, “A statistical approach to sinhala handwriting recognition,” in Proc. of the International Information Technology Conference (IITC), Colombo, Sri Lanka, 2002.
  • Ajward et al. [2010] S. Ajward, N. Jayasundara, S. Madushika, and R. Ragel, “Converting printed sinhala documents to formatted editable text,” in 2010 Fifth International Conference on Information and Automation for Sustainability.    IEEE, 2010, pp. 138–143.
  • Madushanka et al. [2017] P. T. C. Madushanka, R. Bandara, and L. Ranathunga, “Sinhala handwritten character recognition by using enhanced thinning and curvature histogram based method,” in 2017 IEEE 2nd International Conference on Signal and Image Processing (ICSIP).    IEEE, 2017, pp. 46–50.
  • Karunanayaka et al. [2004] M. L. M. Karunanayaka, N. D. Kodikara, and G. D. S. P. Wimalaratne, “Off line sinhala handwriting recognition with an application for postal city name recognition,” Il’I’C 2004, 2004.
  • Weerasinghe et al. [2008] R. Weerasinghe, A. Wasala, D. Herath, and V. Welgama, “Nlp applications of sinhala: Tts & ocr,” in Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II, 2008.
  • Weerasinghe et al. [2006b] A. R. Weerasinghe, D. L. Herath, and N. P. K. Medagoda, “A nearest-neighbor based algorithm for printed sinhala character recognition,” Innovations for a Knowledge Economy, p. 11, 2006.
  • Ediriweera [2012] D. N. Ediriweera, “Improviing the accuracy of the output of sinhala ocr by using a dictionary,” Ph.D. dissertation, University of Moratuwa Sri Lanka, 2012.
  • Dias et al. [2013a] G. Dias, T. N. P. Patikirikorala, C. I. Arambewela, R. P. M. Darshana, and N. D. Alahendra, “Sinhala optical character recognition for desktops,” 2013.
  • Dias et al. [2013b] G. Dias, T. N. P. Patikirikorala, C. I. Arambewela, R. P. M. Darshani, and N. D. Alahendra, “Online sinhala handwritten character recognition for desktops,” 2013.
  • Ranmuthugala et al. [2006] M. H. P. Ranmuthugala, G. D. N. C. Pathiragoda, S. H. C. Jayasundara, G. Dias, and A. S. Karunananda, “Online sinhala handwritten character recognition on handheld devices,” Innovations for a Knowledge Economy, p. 1, 2006.
  • Rimas et al. [2013] M. Rimas, R. P. Thilakumara, and P. Koswatta, “Optical character recognition for sinhala language,” in 2013 IEEE Global Humanitarian Technology Conference: South Asia Satellite (GHTC-SAS).    IEEE, 2013, pp. 149–153.
  • Gunarathna et al. [2014] G. I. Gunarathna, M. A. P. Chamikara, and R. G. Ragel, “A fuzzy based model to identify printed sinhala characters,” in 7th International Conference on Information and Automation for Sustainability.    IEEE, 2014, pp. 1–6.
  • Premachandra et al. [2016] H. W. H. Premachandra, C. Premachandra, T. Kimura, and H. Kawanaka, “Artificial neural network based sinhala character recognition,” in International Conference on Computer Vision and Graphics.    Springer, 2016, pp. 594–603.
  • Jayamaha and Naleer [2016] J. M. H. M. Jayamaha and H. M. M. Naleer, “Feature extraction technique based character recognition using artificial neural network for sinhala characters,” 2016.
  • Kumara and Ragel [2016] T. N. Kumara and R. Ragel, “A systematic feature selection process for a sinhala character recognition system,” in 2016 IEEE International Conference on Information and Automation for Sustainability (ICIAfS).    IEEE, 2016, pp. 1–6.
  • Jayawickrama et al. [2018] B. R. Jayawickrama, L. Ranathunga, K. L. Mahaliyanaarachchi, L. G. B. Subhagya, and W. H. A. Nimasha, “Letter segmentation and modifier detection in printed sinhala signage,” in 2018 18th International Conference on Advances in ICT for Emerging Regions (ICTer).    IEEE, 2018, pp. 203–208.
  • Gunawardhana and Ranathunga [2018] S. Gunawardhana and L. Ranathunga, “Segmentation and identification of presence of sinhala characters in facebook images,” in 2018 IEEE 13th International Conference on Industrial and Information Systems (ICIIS).    IEEE, 2018, pp. 77–82.
  • Fernando et al. [2003] H. C. Fernando, N. D. Kodikara, and S. Hewavitharana, “A database for handwriting recognition research in sinhala language.” in ICDAR, 2003, pp. 1262–1264.
  • Karunanayaka et al. [2005] M. L. M. Karunanayaka, C. A. Marasinghe, and N. D. Kodikara, “Thresholding, noise reduction and skew correction of sinhala handwritten words.” in MVA, 2005, pp. 355–358.
  • Jayasekara and Udawatta [2005] B. Jayasekara and L. Udawatta, “Non-cursive sinhala handwritten script recognition: A genetic algorithm based alphabet training approach,” in Proceedings of the International Conference on Information and Automation, 2005.
  • Nilaweera et al. [2007] N. P. T. I. Nilaweera, H. L. Premeratne, and D. U. J. Sonnadara, “Comparison of projection and wavelet based techniques in recognition of sinhala handwritten scripts,” in Proceedings of the 25th National IT Conference, 2007.
  • Silva and Kariyawasam [2014] C. Silva and C. Kariyawasam, “Segmenting sinhala handwritten characters,” International Journal of Conceptions on Computing and Information Technology, vol. 2, no. 4, pp. 22–26, 2014.
  • Silva et al. [2014] C. M. Silva, N. D. Jayasundere, and C. Kariyawasam, “State of handwriting recognition of modern sinhala script,” 2014.
  • Silva et al. [2015] ——, “Contour tracing for isolated sinhala handwritten character recognition,” in 2015 Fifteenth International Conference on Advances in ICT for Emerging Regions (ICTer).    IEEE, 2015, pp. 25–31.
  • Dharmapala et al. [2017] K. A. K. N. D. Dharmapala, W. P. M. V. Wijesooriya, C. P. Chandrasekara, U. K. A. U. Rathnapriya, and L. Ranathunga, “Sinhala handwriting recognition mechanism using zone based feature extraction,” 2017.
  • Walawage and Ranathunga [2018] K. S. A. Walawage and L. Ranathunga, “Segmentation of overlapping and touching sinhala handwritten characters,” in 2018 3rd International Conference on Information Technology Research (ICITR).    IEEE, 2018, pp. 1–6.
  • Rathnasena et al. [2018] K. A. M. P. Rathnasena, K. M. S. J. Kumarasinghe, D. T. P. Paranavitharana, D. V. A. U. Dayarathne, and L. Ranathunga, “Summarization based approach for old sinhala text archival search and preservation,” in 2018 18th International Conference on Advances in ICT for Emerging Regions (ICTer).    IEEE, 2018, pp. 182–188.
  • Peiris [2012] T. M. T. H. Peiris, “Recognition of inscriptions in ancient sri lanka,” 2012.
  • Karunarathne et al. [2017] K. G. N. D. Karunarathne, K. V. Liyanage, D. A. S. Ruwanmini, K. Dias, and S. Nandasara, “Recognizing ancient sinhala inscription characters using neural network technologies,” Internationa Journal of Scientific Emgineering and Applied Sciences, vol. 3, no. 1, 2017.
  • Chanda et al. [2008] S. Chanda, S. Pal, and U. Pal, “Word-wise sinhala tamil and english script identification using gaussian kernel svm,” in 2008 19th International Conference on Pattern Recognition.    IEEE, 2008, pp. 1–4.
  • Liyanapathirana and Weerasinghe [2011] J. Liyanapathirana and R. Weerasinghe, “English to sinhala machine translation: Towards better information access for sri lankans,” in Conference on Human Language Technology for Development, 2011, pp. 182–186.
  • Liyanapathirana [2013] J. U. Liyanapathirana, “A statistical approach to english and sinhala translation,” 2013.
  • Wijerathna et al. [2012] L. Wijerathna, W. L. S. L. Somaweera, S. L. Kaduruwana, Y. V. Wijesinghe, D. I. De Silva, K. Pulasinghe, and S. Thellijjagoda, “A translator from sinhala to english and english to sinhala (sees),” in International Conference on Advances in ICT for Emerging Regions (ICTer2012).    IEEE, 2012, pp. 14–18.
  • De Silva et al. [2008] D. De Silva, A. Alahakoon, I. Udayangani, V. Kumara, D. Kolonnage, H. Perera, and S. Thelijjagoda, “Sinhala to english language translator,” in 2008 4th International Conference on Information and Automation for Sustainability.    IEEE, 2008, pp. 419–424.
  • Silva and Weerasinghe [2008] A. M. Silva and R. Weerasinghe, “Example based machine translation for english-sinhala translations,” in Proceedings of the 09th International IT Conference, 2008, pp. 27–28.
  • Vidanaralage et al. [2018] A. J. Vidanaralage, A. U. Illangakoon, S. Y. Sumanaweera, C. Pavithra, and S. Thelijjagoda, “Sinhala language decoder,” in 2018 National Information Technology Conference (NITC).    IEEE, 2018, pp. 1–5.
  • Tennage et al. [2017a] P. Tennage, P. Sandaruwan, M. Thilakarathne, A. Herath, S. Ranathunga, S. Jayasena, and G. Dias, “Neural machine translation for sinhala and tamil languages,” in Asian Language Processing (IALP), 2017 International Conference on.    IEEE, 2017, pp. 189–192.
  • Tennage et al. [2017b] P. N. Tennage, M. W. D. P. Sandaruwan, J. K. M. M. Thilakarathne, A. N. Herath, S. Ranathunga, S. Jayasena, and G. Dias, “Neural machine translation for sinhala-tamil,” 2017.
  • Tennage et al. [2018a] P. Tennage, A. Herath, M. Thilakarathne, P. Sandaruwan, and S. Ranathunga, “Transliteration and byte pair encoding to improve tamil to sinhala neural machine translation,” in 2018 Moratuwa Engineering Research Conference (MERCon).    IEEE, 2018, pp. 390–395.
  • Tennage et al. [2018b] P. Tennage, P. Sandaruwan, M. Thilakarathne, A. Herath, and S. Ranathunga, “Handling rare word problem using synthetic training data for sinhala and tamil neural machine translation,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), 2018.
  • Ranathunga et al. [2018] S. Ranathunga, F. Farhath, U. Thayasivam, S. Jayasena, and G. Dias, “Si-ta: Machine translation of sinhala and tamil official documents,” in 2018 National Information Technology Conference (NITC).    IEEE, 2018, pp. 1–6.
  • Farhath et al. [2018b] F. Farhath, S. Ranathunga, S. Jayasena, and G. Dias, “Integration of bilingual lists for domain-specific statistical machine translation for sinhala-tamil,” in 2018 Moratuwa Engineering Research Conference (MERCon).    IEEE, 2018, pp. 538–543.
  • Weerasinghe [2003] R. Weerasinghe, “A statistical machine translation approach to sinhala-tamil language translation,” Towards an ICT enabled Society, p. 136, 2003.
  • Sripirakas et al. [2010] S. Sripirakas, A. R. Weerasinghe, and D. L. Herath, “Statistical machine translation of systems for sinhala-tamil,” in Advances in ICT for Emerging Regions (ICTer), 2010 International Conference on.    IEEE, 2010, pp. 62–68.
  • Jeyakaran [2013] M. Jeyakaran, “A novel kernel regression based machine translation system for sinhala-tamil translation,” 2013.
  • Pushpananda et al. [2013] R. Pushpananda, R. Weerasinghe, and M. Niranjan, “Towards sinhala tamil machine translation,” in Advances in ICT for Emerging Regions (ICTer), 2013 International Conference on.    IEEE, 2013, pp. 288–288.
  • Pushpananda et al. [2014] ——, “Sinhala-tamil machine translation: Towards better translation quality,” in Proceedings of the Australasian Language Technology Association Workshop 2014, 2014, pp. 129–133.
  • Rajpirathap et al. [2015] S. Rajpirathap, S. Sheeyam, K. Umasuthan, and A. Chelvarajah, “Real-time direct translation system for sinhala and tamil languages,” in 2015 Federated Conference on Computer Science and Information Systems (FedCSIS).    IEEE, 2015, pp. 1437–1443.
  • Thelijjagoda [2004] S. Thelijjagoda, “Japanese-sinhalese mt system (jaw/sinhalese),” in Proceedings of Asian Symposium on Natural Language Processing to Overcome Language Barriers, IJCNLP-04 Satellite Symposium, 2004.
  • Shalini and Hettige [2017] R. M. M. Shalini and B. Hettige, “Dictionary based machine translation system for pali to sinhala,” in SLAAI-International Conference on Artificial Intelligence, 2017, p. 23.
  • Wasala et al. [2010] A. Wasala, R. Weerasinghe, R. Pushpananda, C. Liyanage, and E. Jayalatharachchi, “A data-driven approach to checking and correcting spelling errors in sinhala,” Int. J. Adv. ICT Emerg. Reg, vol. 3, no. 01, 2010.
  • Wasala et al. [2011] R. A. Wasala, R. Weerasinghe, R. Pushpananda, C. Liyanage, and E. Jayalatharachchi, “An open-source data driven spell checker for sinhala,” ICTer, vol. 3, no. 1, 2011.
  • Jayalatharachchi et al. [2012] E. Jayalatharachchi, A. Wasala, and R. Weerasinghe, “Data-driven spell checking: the synergy of two algorithms for spelling error detection and correction,” in International Conference on Advances in ICT for Emerging Regions (ICTer2012).    IEEE, 2012, pp. 7–13.
  • Subhagya et al. [2018] L. G. B. Subhagya, L. Ranathunga, W. H. A. Nimasha, B. R. Jayawickrama, and K. L. Mahaliyanaarchchi, “Data driven approach to sinhala spellchecker and correction,” in 2018 18th International Conference on Advances in ICT for Emerging Regions (ICTer).    IEEE, 2018, pp. 01–06.
  • Punchimudiyanse and Meegama [2017a] M. Punchimudiyanse and R. G. N. Meegama, “Computer interpreter for translating written sinhala to sinhala sign,” OUSL Journal, vol. 12, no. 1, pp. 70–90, 2017.
  • Punchimudiyanse and Meegama [2017b] ——, “Animation of fingerspelled words and number signs of the sinhala sign language,” ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), vol. 16, no. 4, p. 24, 2017.
  • Fernando [2011] S. C. Fernando, “Inexact matching of proper names in sinhala,” 2011.