Canonical and Surface Morphological Segmentation for Nguni Languages

Abstract

Morphological Segmentation involves decomposing words into morphemes, thesmallest meaning-bearing units of language. This is an important NLP task formorphologically-rich agglutinative languages such as the Southern African Ngunilanguage group. In this paper, we investigate supervised and unsupervisedmodels for two variants of morphological segmentation: canonical and surfacesegmentation. We train sequence-to-sequence models for canonical segmentation,where the underlying morphemes may not be equal to the surface form of theword, and Conditional Random Fields (CRF) for surface segmentation.Transformers outperform LSTMs with attention on canonical segmentation,obtaining an average F1 score of 72.5% across 4 languages. Feature-based CRFsoutperform bidirectional LSTM-CRFs to obtain an average of 97.1% F1 on surfacesegmentation. In the unsupervised setting, an entropy-based approach using acharacter-level LSTM language model fails to outperforms a Morfessor baseline,while on some of the languages neither approach performs much better than arandom baseline. We hope that the high performance of the supervisedsegmentation models will help to facilitate the development of better NLP toolsfor Nguni languages.

Quick Read (beta)

loading the full paper ...