Automatic Language Identification for Romance Languages using Stop Words and Diacritics

Abstract

Automatic language identification is a natural language processing problemthat tries to determine the natural language of a given content. In this paperwe present a statistical method for automatic language identification ofwritten text using dictionaries containing stop words and diacritics. Wepropose different approaches that combine the two dictionaries to accuratelydetermine the language of textual corpora. This method was chosen because stopwords and diacritics are very specific to a language, although some languageshave some similar words and special characters they are not all common. Thelanguages taken into account were romance languages because they are verysimilar and usually it is hard to distinguish between them from a computationalpoint of view. We have tested our method using a Twitter corpus and a newsarticle corpus. Both corpora consists of UTF-8 encoded text, so the diacriticscould be taken into account, in the case that the text has no diacritics onlythe stop words are used to determine the language of the text. The experimentalresults show that the proposed method has an accuracy of over 90% for smalltexts and over 99.8% for

Quick Read (beta)

loading the full paper ...