Cross Script Hindi English NER Corpus from Wikipedia

Abstract

The text generated on social media platforms is essentially a mixed lingualtext. The mixing of language in any form produces considerable amount ofdifficulty in language processing systems. Moreover, the advancements inlanguage processing research depends upon the availability of standard corpora.The development of mixed lingual Indian Named Entity Recognition (NER) systemsare facing obstacles due to unavailability of the standard evaluation corpora.Such corpora may be of mixed lingual nature in which text is written usingmultiple languages predominantly using a single script only. The motivation ofour work is to emphasize the automatic generation such kind of corpora in orderto encourage mixed lingual Indian NER. The paper presents the preparation of aCross Script Hindi-English Corpora from Wikipedia category pages. The corporais successfully annotated using standard CoNLL-2003 categories of PER, LOC,ORG, and MISC. Its evaluation is carried out on a variety of machine learningalgorithms and favorable results are achieved.

Quick Read (beta)

loading the full paper ...