Unsupervised Separation of Native and Loanwords for Malayalam and Telugu

Abstract

Quite often, words from one language are adopted within a different languagewithout translation; these words appear in transliterated form in text writtenin the latter language. This phenomenon is particularly widespread withinIndian languages where many words are loaned from English. In this paper, weaddress the task of identifying loanwords automatically and in an unsupervisedmanner, from large datasets of words from agglutinative Dravidian languages. Wetarget two specific languages from the Dravidian family, viz., Malayalam andTelugu. Based on familiarity with the languages, we outline an observation thatnative words in both these languages tend to be characterized by a much moreversatile stem - stem being a shorthand to denote the subword sequence formedby the first few characters of the word - than words that are loaned from otherlanguages. We harness this observation to build an objective function and aniterative optimization formulation to optimize for it, yielding a scoring ofeach word's nativeness in the process. Through an extensive empirical analysisover real-world datasets from both Malayalam and Telugu, we illustrate theeffectiveness of our method in quantifying nativeness effectively overavailable baselines for the task.

Quick Read (beta)

loading the full paper ...