CoLI-Machine Learning Approaches for Code-mixed Language Identification at the Word Level in Kannada-English Texts

Abstract

The task of automatically identifying a language used in a given text iscalled Language Identification (LI). India is a multilingual country and manyIndians especially youths are comfortable with Hindi and English, in additionto their local languages. Hence, they often use more than one language to posttheir comments on social media. Texts containing more than one language arecalled "code-mixed texts" and are a good source of input for LI. Languages inthese texts may be mixed at sentence level, word level or even at sub-wordlevel. LI at word level is a sequence labeling problem where each and everyword in a sentence is tagged with one of the languages in the predefined set oflanguages. In order to address word level LI in code-mixed Kannada-English(Kn-En) texts, this work presents i) the construction of code-mixed Kn-Endataset called CoLI-Kenglish dataset, ii) code-mixed Kn-En embedding and iii)learning models using Machine Learning (ML), Deep Learning (DL) and TransferLearning (TL) approaches. Code-mixed Kn-En texts are extracted from KannadaYouTube video comments to construct CoLI-Kenglish dataset and code-mixed Kn-Enembedding. The words in CoLI-Kenglish dataset are grouped into six majorcategories, namely, "Kannada", "English", "Mixed-language", "Name", "Location"and "Other". The learning models, namely, CoLI-vectors and CoLI-ngrams based onML, CoLI-BiLSTM based on DL and CoLI-ULMFiT based on TL approaches are builtand evaluated using CoLI-Kenglish dataset. The performances of the learningmodels illustrated, the superiority of CoLI-ngrams model, compared to othermodels with a macro average F1-score of 0.64. However, the results of all thelearning models were quite competitive with each other.

Quick Read (beta)

loading the full paper ...