Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks

Abstract

In this paper we present a novel lemmatization method based on asequence-to-sequence neural network architecture and morphosyntactic contextrepresentation. In the proposed method, our context-sensitive lemmatizergenerates the lemma one character at a time based on the surface formcharacters and its morphosyntactic features obtained from a morphologicaltagger. We argue that a sliding window context representation suffers fromsparseness, while in majority of cases the morphosyntactic features of a wordbring enough information to resolve lemma ambiguities while keeping the contextrepresentation dense and more practical for machine learning systems.Additionally, we study two different data augmentation methods utilizingautoencoder training and morphological transducers especially beneficial forlow resource languages. We evaluate our lemmatizer on 52 different languagesand 76 different treebanks, showing that our system outperforms all latestbaseline systems. Compared to the best overall baseline, UDPipe Future, oursystem outperforms it on 60 out of 76 treebanks reducing errors on average by18% relative. The lemmatizer together with all trained models is made availableas a part of the Turku-neural-parsing-pipeline under the Apache 2.0 license.

Quick Read (beta)

loading the full paper ...