Universal Word Segmentation: Implementation and Interpretation

  • 2018-07-09 07:51:51
  • Yan Shao, Christian Hardmeier, Joakim Nivre
  • 5

Abstract

Word segmentation is a low-level NLP task that is non-trivial for aconsiderable number of languages. In this paper, we present a sequence taggingframework and apply it to word segmentation for a wide range of languages withdifferent writing systems and typological characteristics. Additionally, weinvestigate the correlations between various typological factors and wordsegmentation accuracy. The experimental results indicate that segmentationaccuracy is positively related to word boundary markers and negatively to thenumber of unique non-segmental terms. Based on the analysis, we design a smallset of language-specific settings and extensively evaluate the segmentationsystem on the Universal Dependencies datasets. Our model obtainsstate-of-the-art accuracies on all the UD languages. It performs substantiallybetter on languages that are non-trivial to segment, such as Chinese, Japanese,Arabic and Hebrew, when compared to previous work.

 

Quick Read (beta)

loading the full paper ...