Survey on Publicly Available Sinhala Natural Language Processing Tools and Research

  • 2019-06-05 23:36:06
  • Nisansa de Silva
  • 6

Abstract

Sinhala is the native language of the Sinhalese people who make up thelargest ethnic group of Sri Lanka. The language belongs to the globe-spanninglanguage tree, Indo-European. However, due to poverty in both linguistic andeconomic capital, Sinhala, in the perspective of Natural Language processingtools and research, remains a resource-poor language which has neither theeconomic drive its cousin English has nor the sheer push of the law of numbersa language such as Chinese has. A number of research groups from Sri Lanka havenoticed this lack and the dire need for proper tools and research for Sinhalanatural language processing. However, due to various reasons, these attemptsseem to lack coordination and awareness of each other. The objective of thispaper is to fill that gap of a comprehensive literature survey of the publiclyavailable Sinhala natural language tools and research so that the researchersworking in this field can better utilize contributions of their peers. As such,we shall be uploading this paper to arXiv and perpetually update itperiodically to reflect the advances made on the topic.

 

Quick Read (beta)

Survey on Publicly Available Sinhala Natural Language Processing Tools and Research

Nisansa de Silva Nisansa de Silva is with the Department of Computer Science & Engineering, University of Moratuwa.
E-mail: [email protected]
Abstract
\justify

Sinhala is the native language of the Sinhalese people who make up the largest ethnic group of Sri Lanka. The language belongs to the globe-spanning language tree, Indo-European. However, due to poverty in both linguistic and economic capital, Sinhala, in the perspective of Natural Language processing tools and research, remains a resource-poor language which has neither the economic drive its cousin English has nor the sheer push of the law of numbers a language such as Chinese has. A number of research groups from Sri Lanka have noticed this lack and the dire need for proper tools and research for Sinhala natural language processing. However, due to various reasons, these attempts seem to lack coordination and awareness of each other. The objective of this paper is to fill that gap of a comprehensive literature survey of the publicly available Sinhala natural language tools and research so that the researchers working in this field can better utilize contributions of their peers. As such, we shall be uploading this paper to arXiv and perpetually update it periodically to reflect the advances made on the topic.

Sinhala, Natural Language Processing, Resource Poor Language

1 Introduction

Sinhala language, being the native language of the Sinhalese people [1], who make up the largest ethnic group of the island country of Sri Lanka, enjoys being reported as the mother tongue of Approximately 16 million people [2]. To give a brief linguistic background for the purpose of aligning the Sinhala language with the baseline of English, primarily it should be noted that Sinhala language belongs same the Indo-European language tree [3]. However, unlike English which is part of the Germanic branch, Sinhala belongs to the Indo-Aryan branch. Further, Sinhala, unlike English, which borrowed the Latin alphabet, has its own writing system, which is a descendant of the Indian Brahmi script [4, 5, 6, 7, 8, 9]. By extension, this makes Sinhala Script a member of the Aramaic family of scripts [10, 11]. It should be noted that the modern Sinhala language have loanwords from languages such as Tamil, English, Portuguese, and Dutch due to various historical reasons. Regardless of the rich historical array of literature spanning several millennia starting between 3rd to 2nd century BCE [12], modern natural language processing tools for the Sinhala language are scarce [13].

Natural Language Processing (NLP) is a broad area covering all computational processing and analysis of human languages. To achieve this end, NLP systems operate at different levels [14, 15]. A graphical representation of NLP layers and application domains are shown in Figure 1. On one hand, according to Liddy [15], these systems can be categorized into the following layers; phonological, morphological, lexical, syntactic, semantic, discourse, and pragmatic. The phonological layer deals with the interpretation of language sounds. As such, it consists of mainly speech-to-text and text-to-speech systems. In cases where one is working with written text of the language rather than speech, it is possible to replace this layer with tools which handle Optical Character Recognition (OCR) and language rendering standards (such as Unicode [16]). The morphological layer analyses words at their smallest units of meaning. As such, analysis on word lemmas and prefix suffix based inflection are handled in this layer. Lexical layer handles individual words. Therefore tasks such as Part of Speech (PoS) tagging happens here. The next layer, syntactic, takes place at the phrase and sentence level where grammatical structures are utilized to obtain meaning. Semantic layer attempts to derive the meanings from the word level to the sentence level. Starting with Named Entity Recognition (NER) at the word level and working its way up by identifying the contexts they are set in until arriving at over all meaning. The discourse layer handles meaning in textual units larger than a sentence. In this, the function of a particular sentence maybe contextualized within the document it is set in. Finally, the pragmatic layer handles contexts read into contents without having to be explicitly mentioned [14, 15]. Some forms of anaphora (co-reference) resolution fall into this application.

Fig. 1: NLP layers and tasks [14]

On the other hand, Wimalasuriya and Dou [17] categorizes NLP tools and research by utility. They introduce three categories with increasing complexity;Information Retrieval (IR), Information Extraction (IE), and Natural Language Understanding (NLU). Information Retrieval covers applications which search and retrieve information which are relevant to a given query. For pure IR, tools and methods up-to and including the syntactic layer in the above analysis are used. Information Extraction on the other hand extracts structured information. The difference between IR and IE is the fact that IR does not change the structure of the documents in question. Be them structured, semi-structured, or unstructured, all IR does is fetching them as they are. In comparison, IE, takes semi-structured and unstructured text and puts them in a machine readable structure. For this, IE utilizes all the layers used by IR and the semantic layer. Natural Language Understanding is purely the idea of cognition. Most NLU tasks fall under AI-hard category and remain unsolved [14]. However, with varying accuracy, some NLU tasks are being attempted such as; machine translation. The pragmatic layer of the above analysis belongs to the NLU tasks while the discourse layer straddles information extraction and natural language understanding [14].

The objective of this paper is to serve as a comprehensive survey on the state of natural language processing resources for the Sinhala language. The initial structure and content of this survey are heavily influenced by the preliminary surveys carried out by de Silva [13] and Wijeratne et al. [14]. However, our hope is to host this survey at arXiv as a perpetually evolving work which continuously gets updated as new research and tools for Sinhala language are created and made publicly available. Hence, it is our hope that this work will help future researchers who are engaged in Sinhala NLP research to conduct their literature surveys efficiently and comprehensively. For the success of this survey, we shall also consider the Sri Lankan NLP tools repository, lknlp11 1 https://github.com/lknlp/lknlp.github.io.

The remainder of this survey is organized as follows; Section 2 discusses the various tools and research available for Sinhala NLP. In this section we would discuss both pure Sinhala NLP tool ans research as well as hybrid Sinhala-English work. Section 3 discusses research and tools which contributes to Sinhala NLP either along with or by the help of Tamil, the other official language of Sri Lanka. Finally Section 4 concludes the survey.

2 Sinhala resources

In this section we generally follow the structure shown in Figure 1 for sectioning. However, in addition to that, we also discuss topics such as available corpora, other data sets, dictionaries, and WordNets. In this section we would discuss both pure Sinhala NLP tool ans research as well as hybrid Sinhala-English work which utilizes the much richer availability of tools and research for the near global ubiquitous English language.

2.1 Corpora

For any language, the key for NLP applications and implementations is the existence of adequate corpora. On this matter a relatively substantial Sinhala text corpus22 2 https://osf.io/a5quv/ was created by Upeksha et al. [18, 19] by web crawling. Later a smaller Sinhala newes corpus33 3 https://osf.io/tdb84/ was created by de Silva [13]. Both of the above corpora are publicly available. However, none of these come close to the massive capacity and range of the existing English corpora. A word corpus of approximately 35,000 entries was developed by Weerasinghe et al. [20]. But it does not seem to be online anymore. A number of Sinhala-English parallel corpora were introduced by Guzmán et al. [21]. This includes a 600k+ Sinhala-English subtitle pairs44 4 http://bit.ly/2KsFQxm initially collected by [22], 45k+ Sinhala-English sentence pairs from GNOME55 5 http://bit.ly/2Z8q0fo, KDE66 6 http://bit.ly/2WLY6bI, and Ubuntu77 7 http://bit.ly/2wLVZGtGuzmán et al. [21] further provided two monolingual corpora for Sinhala. Those were a 155k+ sentences of filtered Sinhala Wikipedia88 8 http://bit.ly/2EQZ7oM and 5178k+ sentences of Sinhala common crawl99 9 http://bit.ly/2ZaQFZo.

2.2 Data Sets

Specific data sets for Sinhala, as expected. is scarce. However a Sinhala PoS tagged data set [23, 24, 25] created by is available to download from github1010 10 http://bit.ly/2Krhrbv. Further, a Sinhala NER data set created by Manamini et al. [26] is available also to download from github1111 11 http://bit.ly/2XrwCoK.

Facebook has released FastText [27, 28, 29] models for the Sinhala language trained using the Wikipedia corpus. They are available as both text models1212 12 http://bit.ly/2JXAyL8 and binary files1313 13 http://bit.ly/2JY5J9c. Using the above models by Facebook a group at University of Moratuwa has created an extended FastText trained on Wikipedia, News, and official government documents. The binary file1414 14 http://bit.ly/2WowH0h of the trained model is available to be downloaded.

2.3 Dictionaries

A necessary component for the purpose of bridging Sinhala and English resources are English-Sinhala dictionaries. The earliest and most extensive Sinhala-English dictionary available for consumption was by Malalasekera [30]. However, this dictionary is locked behind copyright laws and is not available for public research and development. The dictionary by Kulatunga [31] is publicly available for usage through an online web interface but does not provide API access or means to directly access the data set. The largest publicly available English-Sinhala dictionary data set is from a discontinued FireFox plug-in EnSiTip [32] which bears a more than passing resemblance to the above. Hettige and Karunananda [33] claim to to have created a lexicon to help in their attempt to create a system capable of English to Sinhala machine translation. Yet again, the available Sinhala resources fall well short of what is available for language pairings such as English and French.

2.4 WordNets

WordNets [34] are extremely powerful and versatile component of many NLP applications. They encompass a number of linguistic properties that exist between the words in the lexicon of the language including but not limited to: hyponymy and hypernymy, synonymy, and meronymy. Their uses range from simple gazetteer listing applications [17] to information extraction based on semantic similarity [35, 36] or semantic oppositeness [37]. An attempt has been made to build a Sinhala Wordnet [38]. For a time it was hosted on [39] but it too is now defunct and all the data and applications are lost. However, even at its peak, due to the lack of volunteers for the crowd soured methodology of populating the WordNet, it was at best an incomplete product. Another effort to build a Sinhala Wordnet was initiated by Welgama et al. [40] independently from above; but it too have stopped progression even before achieving the completion level of above.

2.5 Morphological Analyzers

As shown in Fig 1, morphological analysis is a ground level necessary component of natural language processing. Given that Sinhala is a highly highly inflected language [41, 42, 13], a proper morphological analysis process in vital. However, the only work on this avenue of research which could be found was a study which was restricted to morphological analysis of Sinhala verbs [43]. There was no indication on whether this work was continued to cover other types of words. Further, other than this singular publication, no data or tools were made publicly accessible. Completely independent of the above, Welgama et al. [44] attempted to evaluate machine learning approaches for Sinhala morphological analysis. Yet another independent attempt to create a morphological parser for Sinhala verbs was carried out by Fernando and Weerasinghe [45]. As a step on their efforts to create a system with the ability to do English to Sinhala machine translation, Hettige and Karunananda [46] also claim to have created a morphological analyzer.

2.6 Part of Speech Taggers

The next step after morphological analysis is Part of Speech (PoS) tagging. The PoS tags differ in number and functionality from language to language. Therefore, the first step in creating an effective PoS tagger is to identifying the PoS tag set for the language. This work has been accomplished by Fernando et al. [25] and Dilshani et al. [24]. Expanding on that Fernando et al. [25] has introduced a SVM Based PoS Tagger for Sinhala and finally Fernando and Ranathunga [23] give an evaluation of different classifiers for the task of Sinhala PoS tagging. While here it is obvious that there has been some follow up work after the initial foundation, it seems all of that has been internal to one research group at one institution as neither the data nor the tools of any of these findings have been made available for the use of external researchers. Several attempts to create a stochastic part of speech tagger for Sinhala has been done with the attempts by Herath and Weerasinghe [47] and Jayasuriya and Weerasinghe [48] being most notable. A hybrid PoS tagger for Sinhala language was proposed by Gunasekara et al. [49]. Within a single group yet another set of studies was carried out to create a Sinhala PoS tagger starting with the foundation of Jayaweera and Dias [50] which then extended to a Hidden Markov Model (HMM) based approach [51] and an analysis of unknown words [52]. Further, this group presented a comparison of few Sinhala PoS taggers that are available to them [53].

2.7 Parsers

The PoS tagged data is then needs to be handed over to a parser. This is an area that is not completely solved even in English due to various ambiguities in natural language. However, in the case of English, there are systems that provide adequate results [54] if not perfect yet. A prosodic phrasing model for sinhala language has been implemented by Bandara et al. [55]. While they do report reasonable results, yet again, do not provide any means for the public to access the data or the tools that they have developed. Work by Liyanage et al. [41] is also concentrated on this layer given that they have worked on formalizing a computational grammar for Sinhala. Kanduboda and Prabath [42]’s work on Sinhala differential object markers also is an example of research done for the Sinhala language in the parser level. Another parser for the Sinhala language has been proposed by Hettige and Karunananda [56] with a model for grammar [57].

2.8 Named Entity Recognition Systems

As shown in Fig 1, once the text is properly parsed, it has to be processed using a Named-Entity-Recognition (NER) system. An NER system for Sinhla named Ananya has been developed by Manamini et al. [26]. But similar to the above developments, the developed data and tools seems to be held internally by the research group rather than making it publicly available. Another independent attempt on Sinhala NER has been done by Dahanayaka and Weerasinghe [58]; but that too is not accessible to the public.

2.9 Semantic Tools

Applications of the semantic layer is more advanced than the ones below it in Figure 1. But even with the obvious lack of resources and tools, a number of attempts have been made on semantic level applications for the Sinhala Language. A Sinhala semantic similarity measure has been developed for short sentences by Kadupitiya et al. [59]. This work has been then extended by Kadupitiya et al. [60] for the application use case of short answer grading. Data and tools for these projects are not publicly available. Text classification is a popular application on the semantic layer of the NLP stack. Nanayakkara and Ranathunga [61] has implemented a system which uses corpus-based similarity measures for this propose. This too, is unavailable for external researchers. A smaller implementation of Sinhala news classification has been attempted by de Silva [13]. As mentioned above, their news corpus is publicly available1515 15 https://osf.io/tdb84/. But it is extremely small and thus may not provide much use for extensive research. A word2vec based tool1616 16 http://bit.ly/2QKI9Np for sentiment analysis of Sinhala news comments is available.

2.10 Phonological Tools

On the case of phonological layer, a Sinhala text-to-speech system was developed by Weerasinghe et al. [62]. However, it is not publicly accessible and there is no work to be found of the work on a Sinhala speech-to-text system. A separate group has done work on Sinhala text to speech systems independantly to above [63]. On the converse, Nadungodage et al. [64] has done a series of work on Sinhala speech recognition with special notice given to Sinhala being a resource poor language. This project divides its focus on:continuity [65], active learning [66], and speaker adaptation [67].

2.11 Optical Character Recognition Tools

While it is not necessarily a component of the NLP stack shown in Fig 1, which follows the definition by Liddy [15], it possible to swap out the bottom most phonological layer of the stack in favour of an Optical Character Recognition (OCR) layer. This is more relevant in the case of NLP for government in the sense that it is more probable that the government NLP systems may need to work with handwritten forms and letters rather than handling spoken voice clips. With this in mind, let us looks at the available Sinhala OCR systems. Th earliest attempt for Sinhala OCR system has been by Dias et al. [68]. Then it has been extended to be online and made available to use via desktops [69] and hand-held devices [70] with the ability to recognize handwriting. A separate group had also attempted Sinhala OCR [71] mainly involving the nearest-neighbor method [72]. A yet another attempt on this problem has been taken by Rajapakse et al. [73] before the above two groups.

2.12 Sinhala-English Translators

A series of work has been done by a group towards English to Sinhala translation as mentioned in some of the above paragraphs. This work includes; building a morphological analyzer [46], lexicon databases [33], a transliteration system [74], an evaluation model [75], a computational model of grammar [57], and a multi-agent solution [76]. Another group independently attempted English to Sinhala machine translation [77] with a statistical approach [78].

3 Sinhala-Tamil bridging resources

Being resource poor languages, Sinhala and Tamil NLP implementations can potentially help each other up by collaboration. The fact that the two languages are the official languages of Sri Lnka ought to generate a significant amount of parallel data sets in the form of official government documents. It is vital to have a direct link between Sinhala and Tamil rather than using English to weakly link them together. This is especially true in the case of translation where a double layer translation approach of Sinhala to English and then English to Tamil (or the other way around) might cause a serious information loss in translation.

As in the case of any pair of languages, the most vital NLP resource is the set of parallel copora. To this end, Mohamed et al. [79] claim to have built a word aligned Sinhala-Tamil parallel corpus. However, at the time of writing this paper, it was not publicly available. A very small Sinhala-Tamil aligned parallel corpus created by Farhath et al. [80] using order papers of government of Sri Lanka is available to download1717 17 http://bit.ly/2HTMEme.

Next, there exists the government sponsored trilingual dictionary [81]. However, other than a crude web interface on the ministry website, there is no efficient API or any other way for a researcher to access the data of this dictionary. Weerasinghe and Dias [82] have created a multilingual place name database for Sri Lanka wich may function both as a dictionary and a resource for certain NER tasks.

Most of the cross Sinhala and Tamil work has been done in the domain of machine translation. A neural machine translation for Sinhala and Tamil languages was initiated by Tennage et al. [83]. Then they further enhanced it with transliteration and byte pair encoding [84] and used synthetic training data to handle the rare word problem [85]. This project produced Si-Ta [86] a machine translation system of Sinhala and Tamil official documents. In the statistical machine translation front, Farhath et al. [87] worked on integrating bilingual lists. The attempts by Weerasinghe [88] and Sripirakas et al. [89] were also focused on statistical machine translation while Jeyakaran [90] attempted a kernel regression method. A yet another attempt was made by Pushpananda et al. [91] which they later extended with some quality improvements [92].

4 Conclusion

At this point, a reader might think, there seems to be a significant number of implementations of NLP for Sinhala. Therefore, how can one justify listing Sinhala as a resource poor language? The important point that is missing in that assumption that in the cases of almost all of the above listed implementations and findings, the only thing that is publicly available for a researcher is a set of research papers. The corpora, tools, algorithm, and anything else that were discovered through these research are either locked away as properties of individual research groups or worse lost to the time with crashed ancient servers, lost hard drives, and expired web hosts. This reason and probably an unsavoury amount of academic/research competition have caused these attempts not to cite or build upon the work of each-other. In many cases where similar work is done, it is a re-hashing on the same ideas adopted from resource rich languages because of, either the unavailability of or the reluctance to, refer and build on work done by another group. This has resulted in multiple groups building multiple foundations behind closed walls but no one ending up with a completed hut let alone a house. In conclusion, what can be said is even though there are islands of implementations done for Sinhala NLP, they are of very small scale and/or are not readily accessible for further use and research by other researchers. Thus, so far, Sinhala stays a resource poor language.

In conclusion, it has to be noted that similar to the above sections on separate resources for Sinhala, here as well, the available resources fall short of the mark of raising either or both languages to the level of resource rich languages. Again, the fault lie in the scarcity of tools and applications as well as the unavailability of the those few that actually exist.

References

  • Bauer [2007] L. Bauer, Linguistics Student’s Handbook.    Edinburgh University Press, 2007.
  • [2] Department of Census and Statistics Sri Lanka. Percentage of population aged 10 years and over in major ethnic groups by district and ability to speak sinhala, tamil and english languages. [Online]. Available: https://goo.gl/nnVZSd
  • [3] H. Young. A language family tree - in pictures — education — the guardian. [Online]. Available: https://www.theguardian.com/education/gallery/2015/jan/23/a-language-family-tree-in-pictures
  • Bandara et al. [2012] D. Bandara, N. Warnajith, A. Minato, and S. Ozawa, “Creation of precise alphabet fonts of early brahmi script from photographic data of ancient sri lankan inscriptions,” Canadian Journal on Artificial Intelligence, Machine Learning and Pattern Recognition, vol. 3, no. 3, pp. 33–39, 2012.
  • Daniels and Bright [1996] P. T. Daniels and W. Bright, The world’s writing systems.    Oxford University Press on Demand, 1996.
  • Sirisoma [1990] M. Sirisoma, “Brahmi inscriptions of sri lanka from 3rd century bc to 65 ad,” pp. 3–54, 1990.
  • Dias [1996] M. Dias, “Lakdiwa sellipiwalin heliwana sinhala bhashawe prathyartha namayange vikashanaya,” Department of Archaeology, Colombo Sri Lanka, p. 1, 1996.
  • Hettiarachchi [1990] A. Hettiarachchi, “Investigation of 2nd, 3rd and 4th century inscriptions,” Inscriptions: Volume Two, Archaeological Department Centenary (1890–1990), Commemorative Series. Colombo: Department of Archaeology, pp. 57–104, 1990.
  • Paranavitana and Depārtamēntuva [1970] S. Paranavitana and S. L. P. Depārtamēntuva, Inscriptions of Ceylon.    Dept. of Archaeology, 1970.
  • Salomon [1998] R. Salomon, Indian epigraphy: a guide to the study of inscriptions in Sanskrit, Prakrit, and the other Indo-Aryan languages.    Oxford University Press, 1998.
  • Falk [1993] H. Falk, Schrift im alten Indien: ein Forschungsbericht mit Anmerkungen.    Gunter Narr Verlag, 1993, vol. 56.
  • Ray [2003] H. P. Ray, The archaeology of seafaring in ancient South Asia.    Cambridge University Press, 2003.
  • de Silva [2015] N. de Silva, “Sinhala Text Classification: Observations from the Perspective of a Resource Poor Language,” 2015.
  • Wijeratne et al. [2019] Y. Wijeratne, N. de Silva, and Y. Shanmugarajah, “Natural Language Processing for Government: Problems and Potential,” LIRNEasia, 2019.
  • Liddy [2001] E. D. Liddy, “Natural language processing,” 2001.
  • Consortium et al. [1996] U. Consortium et al., “The unicode standard: A technical introduction,” online document, http://www. unicode. org/unicode/standards/principles. html, 1996.
  • Wimalasuriya and Dou [2010] D. C. Wimalasuriya and D. Dou, “Ontology-based information extraction: An introduction and a survey of current approaches,” Journal of Information Science, vol. 36, no. 3, pp. 306–323, 2010.
  • Upeksha et al. [2015a] D. Upeksha, C. Wijayarathna, M. Siriwardena, L. Lasandun, C. Wimalasuriya, N. H. N. D. De Silva, and G. Dias, “Implementing a Corpus for Sinhala Language,” in Symposium on Language Technology for South Asia 2015, 2015.
  • Upeksha et al. [2015b] D. Upeksha, C. Wijayarathna, M. Siriwardena, L. Lasandun, C. Wimalasuriya, N. H. N. D. de Silva, and G. Dias, “Comparison between performance of various database systems for implementing a language corpus,” in International Conference: Beyond Databases, Architectures and Structures.    Springer, May 2015, pp. 82–91.
  • Weerasinghe et al. [2009] R. Weerasinghe, D. Herath, and V. Welgama, “Corpus-based sinhala lexicon,” in Proceedings of the 7th Workshop on Asian Language Resources.    Association for Computational Linguistics, 2009, pp. 17–23.
  • Guzmán et al. [2019] F. Guzmán, P.-J. Chen, M. Ott, J. Pino, G. Lample, P. Koehn, V. Chaudhary, and M. Ranzato, “Two new evaluation datasets for low-resource machine translation: Nepali-english and sinhala-english,” arXiv preprint arXiv:1902.01382, 2019.
  • Lison and Tiedemann [2016] P. Lison and J. Tiedemann, “Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles,” 2016.
  • Fernando and Ranathunga [2018] S. Fernando and S. Ranathunga, “Evaluation of different classifiers for sinhala pos tagging,” in 2018 Moratuwa Engineering Research Conference (MERCon).    IEEE, 2018, pp. 96–101.
  • Dilshani et al. [2017] N. Dilshani, S. Fernando, S. Ranathunga, S. Jayasena, and G. Dias, “A comprehensive part of speech (pos) tag set for sinhala language.”    The Third International Conference on Linguistics in Sri Lanka, ICLSL 2017 …, 2017.
  • Fernando et al. [2016] S. Fernando, S. Ranathunga, S. Jayasena, and G. Dias, “Comprehensive part-of-speech tag set and svm based pos tagger for sinhala,” in Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016), 2016, pp. 173–182.
  • Manamini et al. [2016] S. Manamini, A. Ahamed, R. Rajapakshe, G. Reemal, S. Jayasena, G. Dias, and S. Ranathunga, “Ananya-a named-entity-recognition (ner) system for sinhala language,” in Moratuwa Engineering Research Conference (MERCon), 2016.    IEEE, 2016, pp. 30–35.
  • Bojanowski et al. [2017] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017.
  • Joulin et al. [2017] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for efficient text classification,” in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 2017, pp. 427–431.
  • Joulin et al. [2016] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov, “Fasttext. zip: Compressing text classification models,” arXiv preprint arXiv:1612.03651, 2016.
  • Malalasekera [1967] G. P. Malalasekera, “English-sinhalese dictionary.” 1967.
  • [31] M. Kulatunga. Madura english-sinhala dictionary - online language translator. [Online]. Available: https://maduraonline.com/
  • Wasala and Weerasinghe [2008] A. Wasala and R. Weerasinghe, “Ensitip: a tool to unlock the english web,” in 11th international conference on humans and computers, Nagaoka University of Technology, Japan, 2008, pp. 20–23.
  • Hettige and Karunananda [2007a] B. Hettige and A. Karunananda, “Developing lexicon databases for english to sinhala machine translation,” in Industrial and Information Systems, 2007. ICIIS 2007. International Conference on.    IEEE, 2007, pp. 215–220.
  • Miller [1995] G. A. Miller, “Wordnet: a lexical database for english,” Communications of the ACM, vol. 38, no. 11, pp. 39–41, 1995.
  • Wu and Palmer [1994] Z. Wu and M. Palmer, “Verbs semantics and lexical selection,” in Proceedings of the 32nd annual meeting on Association for Computational Linguistics.    Association for Computational Linguistics, 1994, pp. 133–138.
  • Jiang and Conrath [1997] J. J. Jiang and D. W. Conrath, “Semantic similarity based on corpus statistics and lexical taxonomy,” in Proc of 10th International Conference on Research in Computational Linguistics, ROCLING’97.    Citeseer, 1997.
  • de Silva et al. [2017] N. de Silva, D. Dou, and J. Huang, “Discovering inconsistencies in pubmed abstracts through ontology-based information extraction,” in Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics.    ACM, 2017, pp. 362–371.
  • Wijesiri et al. [2014] I. Wijesiri, M. Gallage, B. Gunathilaka, M. Lakjeewa, D. Wimalasuriya, G. Dias, R. Paranavithana, and N. De Silva, “Building a wordnet for sinhala,” in Proceedings of the Seventh Global WordNet Conference, 2014, pp. 100–108.
  • [39] Sinhala wordnet. [Online]. Available: http://www.wordnet.lk/
  • Welgama et al. [2011] V. Welgama, D. L. Herath, C. Liyanage, N. Udalamatta, R. Weerasinghe, and T. Jayawardana, “Towards a sinhala wordnet,” in Proceedings of the Conference on Human Language Technology for Development, 2011.
  • Liyanage et al. [2012] C. Liyanage, R. Pushpananda, D. L. Herath, and R. Weerasinghe, “A computational grammar of sinhala,” in International Conference on Intelligent Text Processing and Computational Linguistics.    Springer, 2012, pp. 188–200.
  • Kanduboda and Prabath [2013] A. Kanduboda and B. Prabath, “On the usage of sinhalese differential object markers object marker /wa/ vs. object marker /ta/,” Theory and Practice in Language Studies, vol. 3, no. 7, p. 1081, 2013.
  • Dilshani and Dias [2017] W. Dilshani and G. Dias, “A corpus-based morphological analysis of sinhala verbs.”    The Third International Conference on Linguistics in Sri Lanka, ICLSL 2017 …, 2017.
  • Welgama et al. [2013] V. Welgama, R. Weerasinghe, and M. Niranjan, “Evaluating a machine learning approach to sinhala morphological analysis,” in Proceedings of the 10th International Conference on Natural Language Processing, Noida, India, 2013.
  • Fernando and Weerasinghe [2013] N. Fernando and R. Weerasinghe, “A morphological parser for sinhala verbs,” in Proceedings of the International Conference on Advances in ICT for Emerging Regions, 2013.
  • Hettige and Karunananda [2006a] B. Hettige and A. S. Karunananda, “A morphological analyzer to enable english to sinhala machine translation,” in Information and Automation, 2006. ICIA 2006. International Conference on.    IEEE, 2006, pp. 21–26.
  • Herath and Weerasinghe [2004] D. L. Herath and A. Weerasinghe, “A stochastic part of speech tagger for sinhala,” in Proceedings of the 06th International Information Technology Conference, 2004, pp. 27–28.
  • Jayasuriya and Weerasinghe [2013] M. Jayasuriya and A. Weerasinghe, “Learning a stochastic part of speech tagger for sinhala,” in Advances in ICT for Emerging Regions (ICTer), 2013 International Conference on.    IEEE, 2013, pp. 137–143.
  • Gunasekara et al. [2016] D. Gunasekara, W. Welgama, and A. Weerasinghe, “Hybrid part of speech tagger for sinhala language,” in Advances in ICT for Emerging Regions (ICTer), 2016 Sixteenth International Conference on.    IEEE, 2016, pp. 41–48.
  • Jayaweera and Dias [2011] A. Jayaweera and N. Dias, “Part of speech (pos) tagger for sinhala language,” 2011.
  • Jayaweera and Dias [2014a] ——, “Hidden markov model based part of speech tagger for sinhala language,” arXiv preprint arXiv:1407.2989, 2014.
  • Jayaweera and Dias [2014b] ——, “Unknown words analysis in pos tagging of sinhala language,” in Advances in ICT for Emerging Regions (ICTer), 2014 International Conference on.    IEEE, 2014, pp. 270–270.
  • Jayaweera and Dias [2016] M. Jayaweera and N. Dias, “Comparison of part of speech taggers for sinhala language,” 2016.
  • Manning et al. [2014] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky, “The Stanford CoreNLP natural language processing toolkit,” in Association for Computational Linguistics (ACL) System Demonstrations, 2014, pp. 55–60. [Online]. Available: http://www.aclweb.org/anthology/P/P14/P14-5010
  • Bandara et al. [2013] W. Bandara, V. Lakmal, T. Liyanagama, S. Bulathsinghala, G. Dias, and S. Jayasena, “A new prosodic phrasing model for sinhala language,” 2013.
  • Hettige and Karunananda [2006b] B. Hettige and A. S. Karunananda, “A parser for sinhala language-first step towards english to sinhala machine translation,” in Industrial and Information Systems, First International Conference on.    IEEE, 2006, pp. 583–587.
  • Hettige and Karunananda [2011] B. Hettige and A. Karunananda, “Computational model of grammar for english to sinhala machine translation,” in Advances in ICT for Emerging Regions (ICTer), 2011 International Conference on.    IEEE, 2011, pp. 26–31.
  • Dahanayaka and Weerasinghe [2014] J. Dahanayaka and A. Weerasinghe, “Named entity recognition for sinhala language,” in Advances in ICT for Emerging Regions (ICTer), 2014 International Conference on.    IEEE, 2014, pp. 215–220.
  • Kadupitiya et al. [2016] J. Kadupitiya, S. Ranathunga, and G. Dias, “Sinhala short sentence similarity calculation using corpus-based and knowledge-based similarity measures,” in Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016), 2016, pp. 44–53.
  • Kadupitiya et al. [2017] ——, “Sinhala short sentence similarity measures using corpus-based simi-larity for short answer grading,” in 6th Workshop on South and Southeast Asian Natural Language Processing, 2017, pp. 44–53.
  • Nanayakkara and Ranathunga [2018] P. Nanayakkara and S. Ranathunga, “Clustering sinhala news articles using corpus-based similarity measures,” in 2018 Moratuwa Engineering Research Conference (MERCon).    IEEE, 2018, pp. 437–442.
  • Weerasinghe et al. [2007] R. Weerasinghe, A. Wasala, V. Welgama, and K. Gamage, “Festival-si: A sinhala text-to-speech system,” in International Conference on Text, Speech and Dialogue.    Springer, 2007, pp. 472–479.
  • [63] L. Nanayakkara, C. Liyanage, P.-T. Viswakula, T. Nagungodage, R. Pushpananda, and R. Weerasinghe, “A human quality text to speech system for sinhala,” in Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages, pp. 157–161.
  • [64] T. Nadungodage, R. Weerasinghe, and M. Niranjan, “Speech recognition for low resourced languages: Efficient use of training data for sinhala speech recognition by active learning.”
  • Nadungodage and Weerasinghe [2011] T. Nadungodage and R. Weerasinghe, “Continuous sinhala speech recognizer,” in Conference on Human Language Technology for Development, Alexandria, Egypt, 2011, pp. 2–5.
  • Nadungodage et al. [2013] T. Nadungodage, R. Weerasinghe, and M. Niranjan, “Efficient use of training data for sinhala speech recognition using active learning,” in Advances in ICT for Emerging Regions (ICTer), 2013 International Conference on.    IEEE, 2013, pp. 149–153.
  • Nadungodage et al. [2015] ——, “Speaker adaptation applied to sinhala speech recognition.” Int. J. Comput. Linguistics Appl., vol. 6, no. 1, pp. 117–129, 2015.
  • Dias et al. [2013a] G. Dias, T. Patikirikorala, C. Arambewela, R. Darshana, and N. Alahendra, “Sinhala optical character recognition for desktops,” 2013.
  • Dias et al. [2013b] G. Dias, T. Patikirikorala, C. Arambewela, R. Darshani, and N. Alahendra, “Online sinhala handwritten character recognition for desktops,” 2013.
  • Ranmuthugala et al. [2006] M. Ranmuthugala, G. Pathiragoda, S. Jayasundara, G. Dias, and A. Karunananda, “Online sinhala handwritten character recognition on handheld devices,” Innovations for a Knowledge Economy, p. 1, 2006.
  • Weerasinghe et al. [2008] R. Weerasinghe, A. Wasala, D. Herath, and V. Welgama, “Nlp applications of sinhala: Tts & ocr,” in Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II, 2008.
  • Weerasinghe et al. [2006] A. Weerasinghe, D. Herath, and N. Medagoda, “A nearest-neighbor based algorithm for printed sinhala character recognition,” Innovations for a Knowledge Economy, p. 11, 2006.
  • Rajapakse et al. [1995] R. K. Rajapakse, A. R. Weerasinghe, and E. K. Seneviratne, “A neural network based character recognition system for sinhala script,” Department of Statistics and Computer Science, University of Colombo, 1995.
  • Hettige and Karunananda [2007b] B. Hettige and A. S. Karunananda, “Transliteration system for english to sinhala machine translation,” in Industrial and Information Systems, 2007. ICIIS 2007. International Conference on.    IEEE, 2007, pp. 209–214.
  • Hettige and Asoka [2010] B. Hettige and S. K. Asoka, “An evaluation methodology for english to sinhala machine translation,” in Information and Automation for Sustainability (ICIAFs), 2010 5th International Conference on.    IEEE, 2010, pp. 31–36.
  • Hettige et al. [2016] B. Hettige, A. Karunananda, and G. Rzevski, “A multi-agent solution for managing complexity in english to sinhala machine translation,” Complex Systems: Fundamentals & Applications, vol. 90, p. 251, 2016.
  • Liyanapathirana and Weerasinghe [2011] J. Liyanapathirana and R. Weerasinghe, “English to sinhala machine translation: Towards better information access for sri lankans,” in Conference on Human Language Technology for Development, 2011, pp. 182–186.
  • Liyanapathirana [2013] J. Liyanapathirana, “A statistical approach to english and sinhala translation,” 2013.
  • Mohamed et al. [2017] M. Z. Mohamed, A. Ihalapathirana, R. A. Hameed, N. Pathirennehelage, S. Ranathunga, S. Jayasena, and G. Dias, “Automatic creation of a word aligned sinhala-tamil parallel corpus,” in Engineering Research Conference (MERCon), 2017 Moratuwa.    IEEE, 2017, pp. 425–430.
  • Farhath et al. [2018a] F. Farhath, P. Theivendiram, S. Ranathunga, S. Jayasena, and G. Dias, “Improving domain-specific smt for low-resourced languages using data from different domains,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), 2018.
  • [81] Department of Official Languages, Sri Lanka. Tri-lingual dictionary. [Online]. Available: https://www.trilingualdictionary.lk/
  • Weerasinghe and Dias [2013] A. Weerasinghe and G. Dias, “Construction of a multilingual place name database for sri lanka,” 2013.
  • Tennage et al. [2017] P. Tennage, P. Sandaruwan, M. Thilakarathne, A. Herath, S. Ranathunga, S. Jayasena, and G. Dias, “Neural machine translation for sinhala and tamil languages,” in Asian Language Processing (IALP), 2017 International Conference on.    IEEE, 2017, pp. 189–192.
  • Tennage et al. [2018a] P. Tennage, A. Herath, M. Thilakarathne, P. Sandaruwan, and S. Ranathunga, “Transliteration and byte pair encoding to improve tamil to sinhala neural machine translation,” in 2018 Moratuwa Engineering Research Conference (MERCon).    IEEE, 2018, pp. 390–395.
  • Tennage et al. [2018b] P. Tennage, P. Sandaruwan, M. Thilakarathne, A. Herath, and S. Ranathunga, “Handling rare word problem using synthetic training data for sinhala and tamil neural machine translation,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), 2018.
  • Ranathunga et al. [2018] S. Ranathunga, F. Farhath, U. Thayasivam, S. Jayasena, and G. Dias, “Si-ta: Machine translation of sinhala and tamil official documents,” in 2018 National Information Technology Conference (NITC).    IEEE, 2018, pp. 1–6.
  • Farhath et al. [2018b] F. Farhath, S. Ranathunga, S. Jayasena, and G. Dias, “Integration of bilingual lists for domain-specific statistical machine translation for sinhala-tamil,” in 2018 Moratuwa Engineering Research Conference (MERCon).    IEEE, 2018, pp. 538–543.
  • Weerasinghe [2003] R. Weerasinghe, “A statistical machine translation approach to sinhala-tamil language translation,” Towards an ICT enabled Society, p. 136, 2003.
  • Sripirakas et al. [2010] S. Sripirakas, A. Weerasinghe, and D. L. Herath, “Statistical machine translation of systems for sinhala-tamil,” in Advances in ICT for Emerging Regions (ICTer), 2010 International Conference on.    IEEE, 2010, pp. 62–68.
  • Jeyakaran [2013] M. Jeyakaran, “A novel kernel regression based machine translation system for sinhala-tamil translation,” 2013.
  • Pushpananda et al. [2013] R. Pushpananda, R. Weerasinghe, and M. Niranjan, “Towards sinhala tamil machine translation,” in Advances in ICT for Emerging Regions (ICTer), 2013 International Conference on.    IEEE, 2013, pp. 288–288.
  • Pushpananda et al. [2014] ——, “Sinhala-tamil machine translation: Towards better translation quality,” in Proceedings of the Australasian Language Technology Association Workshop 2014, 2014, pp. 129–133.