Development of a Dataset and a Deep Learning Baseline Named Entity Recognizer for Three Low Resource Languages: Bhojpuri, Maithili and Magahi

Abstract

In Natural Language Processing (NLP) pipelines, Named Entity Recognition(NER) is one of the preliminary problems, which marks proper nouns and othernamed entities such as Location, Person, Organization, Disease etc. Suchentities, without a NER module, adversely affect the performance of a machinetranslation system. NER helps in overcoming this problem by recognising andhandling such entities separately, although it can be useful in InformationExtraction systems also. Bhojpuri, Maithili and Magahi are low resourcelanguages, usually known as Purvanchal languages. This paper focuses on thedevelopment of a NER benchmark dataset for the Machine Translation systemsdeveloped to translate from these languages to Hindi by annotating parts oftheir available corpora. Bhojpuri, Maithili and Magahi corpora of sizes 228373,157468 and 56190 tokens, respectively, were annotated using 22 entity labels.The annotation considers coarse-grained annotation labels followed by thetagset used in one of the Hindi NER datasets. We also report a Deep Learningbased baseline that uses an LSTM-CNNs-CRF model. The lower baseline F1-scoresfrom the NER tool obtained by using Conditional Random Fields models are 96.73for Bhojpuri, 93.33 for Maithili and 95.04 for Magahi. The Deep Learning-basedtechnique (LSTM-CNNs-CRF) achieved 96.25 for Bhojpuri, 93.33 for Maithili and95.44 for Magahi.

Quick Read (beta)

loading the full paper ...