Bootstrapping Method for Developing Part-of-Speech Tagged Corpus in Low Resource Languages Tagset - A Focus on an African Igbo

Abstract

Most languages, especially in Africa, have fewer or no establishedpart-of-speech (POS) tagged corpus. However, POS tagged corpus is essential fornatural language processing (NLP) to support advanced researches such asmachine translation, speech recognition, etc. Even in cases where there is noPOS tagged corpus, there are some languages for which parallel texts areavailable online. The task of POS tagging a new language corpus with a newtagset usually face a bootstrapping problem at the initial stages of theannotation process. The unavailability of automatic taggers to help the humanannotator makes the annotation process to appear infeasible to quickly produceadequate amounts of POS tagged corpus for advanced NLP research and trainingthe taggers. In this paper, we demonstrate the efficacy of a POS annotationmethod that employed the services of two automatic approaches to assist POStagged corpus creation for a novel language in NLP. The two approaches arecross-lingual and monolingual POS tags projection. We used cross-lingual toautomatically create an initial 'errorful' tagged corpus for a target languagevia word-alignment. The resources for creating this are derived from a sourcelanguage rich in NLP resources. A monolingual method is applied to clean theinduce noise via an alignment process and to transform the source language tagsto the target language tags. We used English and Igbo as our case study. Thisis possible because there are parallel texts that exist between English andIgbo, and the source language English has available NLP resources. The resultsof the experiment show a steady improvement in accuracy and rate of tagstransformation with score ranges of 6.13% to 83.79% and 8.67% to 98.37%respectively. The rate of tags transformation evaluates the rate at whichsource language tags are translated to target language tags.

Quick Read (beta)

loading the full paper ...