Sequence to sequence pretraining for a less-resourced Slovenian language

Abstract

Large pretrained language models have recently conquered the area of naturallanguage processing. As an alternative to predominant masked language modellingintroduced in BERT, the T5 model has introduced a more general trainingobjective, namely sequence to sequence transformation, which includes maskedlanguage model but more naturally fits text generation tasks such as machinetranslation, summarization, open-domain question answering, textsimplification, dialogue systems, etc. The monolingual variants of T5 modelshave been limited to well-resourced languages, while the massively multilingualT5 model supports 101 languages. In contrast, we trained two different sizedT5-type sequence to sequence models for morphologically rich Slovene languagewith much less resources and analyzed their behavior. Concerning classificationtasks, the SloT5 models mostly lag behind the monolingual Slovene SloBERTamodel but are to be considered for the generative tasks.

Quick Read (beta)

loading the full paper ...