Fortification of Neural Morphological Segmentation Models for Polysynthetic Minimal-Resource Languages

Abstract

Morphological segmentation for polysynthetic languages is challenging,because a word may consist of many individual morphemes and training data canbe extremely scarce. Since neural sequence-to-sequence (seq2seq) models definethe state of the art for morphological segmentation in high-resource settingsand for (mostly) European languages, we first show that they also obtaincompetitive performance for Mexican polysynthetic languages in minimal-resourcesettings. We then propose two novel multi-task training approaches -one with,one without need for external unlabeled resources-, and two corresponding dataaugmentation methods, improving over the neural baseline for all languages.Finally, we explore cross-lingual transfer as a third way to fortify our neuralmodel and show that we can train one single multi-lingual model for relatedlanguages while maintaining comparable or even improved performance, thusreducing the amount of parameters by close to 75%. We provide our morphologicalsegmentation datasets for Mexicanero, Nahuatl, Wixarika and Yorem Nokki forfuture research.

Quick Read (beta)

loading the full paper ...