Abstract
Named Entity Recognition (NER) is a fundamental NLP task, commonly formulatedas classification over a sequence of tokens. Morphologically-Rich Languages(MRLs) pose a challenge to this basic formulation, as the boundaries of NamedEntities do not coincide with token boundaries, rather, they respectmorphological boundaries. To address NER in MRLs we then need to answer twofundamental modeling questions: (i) What should be the basic units to beidentified and labeled, are they token-based or morpheme-based? and (ii) Howcan morphological units be encoded and accurately obtained in realistic(non-gold) scenarios? We empirically investigate these questions on a novelparallel NER benchmark we deliver, with parallel token-level and morpheme-levelNER annotations for Modern Hebrew, a morphologically complex language. Ourresults show that explicitly modeling morphological boundaries consistentlyleads to improved NER performance, and that a novel hybrid architecture that wepropose, in which NER precedes and prunes the morphological decomposition (MD)space, greatly outperforms the standard pipeline approach, on both Hebrew NERand Hebrew MD in realistic scenarios.