On the Strength of Character Language Models for Multilingual Named Entity Recognition

Abstract

Character-level patterns have been widely used as features in English NamedEntity Recognition (NER) systems. However, to date there has been no directinvestigation of the inherent differences between name and non-name tokens intext, nor whether this property holds across multiple languages. This paperanalyzes the capabilities of corpus-agnostic Character-level Language Models(CLMs) in the binary task of distinguishing name tokens from non-name tokens.We demonstrate that CLMs provide a simple and powerful model for capturingthese differences, identifying named entity tokens in a diverse set oflanguages at close to the performance of full NER systems. Moreover, by addingvery simple CLM-based features we can significantly improve the performance ofan off-the-shelf NER system for multiple languages.

Quick Read (beta)

loading the full paper ...