Building Low-Resource NER Models Using Non-Speaker Annotation

Abstract

In low-resource natural language processing (NLP), the key problem is a lackof training data in the target language. Cross-lingual methods have had notablesuccess in addressing this concern, but in certain common circumstances, suchas insufficient pre-training corpora or languages far from the source language,their performance suffers. In this work we propose an alternative approach tobuilding low-resource Named Entity Recognition (NER) models using "non-speaker"(NS) annotations, provided by annotators with no prior experience in the targetlanguage. We recruit 30 participants to annotate unfamiliar languages in acarefully controlled annotation experiment, using Indonesian, Russian, andHindi as target languages. Our results show that use of non-speaker annotatorsproduces results that approach or match performance of fluent speakers. NSresults are also consistently on par or better than cross-lingual methods builton modern contextual representations, and have the potential to furtheroutperform with additional effort. We conclude with observations of commonannotation practices and recommendations for maximizing non-speaker annotatorperformance.

Quick Read (beta)

loading the full paper ...