Feature Selection on Noisy Twitter Short Text Messages for Language Identification

Abstract

The task of written language identification involves typically the detectionof the languages present in a sample of text. Moreover, a sequence of text maynot belong to a single inherent language but also may be mixture of textwritten in multiple languages. This kind of text is generated in large volumesfrom social media platforms due to its flexible and user friendly environment.Such text contains very large number of features which are essential fordevelopment of statistical, probabilistic as well as other kinds of languagemodels. The large number of features have rich as well as irrelevant andredundant features which have diverse effect over the performance of thelearning model. Therefore, feature selection methods are significant inchoosing feature that are most relevant for an efficient model. In thisarticle, we basically consider the Hindi-English language identification taskas Hindi and English are often two most widely spoken languages of India. Weapply different feature selection algorithms across various learning algorithmsin order to analyze the effect of the algorithm as well as the number offeatures on the performance of the task. The methodology focuses on the wordlevel language identification using a novel dataset of 6903 tweets extractedfrom Twitter. Various n-gram profiles are examined with different featureselection algorithms over many classifiers. Finally, an exhaustive comparativeanalysis is put forward with respect to the overall experiments conducted forthe task.

Quick Read (beta)

loading the full paper ...