Word Level Language Identification in English Telugu Code Mixed Data

Abstract

In a multilingual or sociolingual configuration Intra-sentential CodeSwitching (ICS) or Code Mixing (CM) is frequently observed nowadays. In theworld, most of the people know more than one language. CM usage is especiallyapparent in social media platforms. Moreover, ICS is particularly significantin the context of technology, health, and law where conveying the upcomingdevelopments are difficult in one's native language. In applications likedialog systems, machine translation, semantic parsing, shallow parsing, etc. CMand Code Switching pose serious challenges. To do any further advancement incode-mixed data, the necessary step is Language Identification. In this paper,we present a study of various models - Nave Bayes Classifier, Random ForestClassifier, Conditional Random Field (CRF), and Hidden Markov Model (HMM) forLanguage Identification in English - Telugu Code Mixed Data. Considering thepaucity of resources in code mixed languages, we proposed the CRF model and HMMmodel for word level language identification. Our best performing system isCRF-based with an f1-score of 0.91.

Quick Read (beta)

loading the full paper ...