RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use

Abstract

Large transformer-based language models, e.g. BERT and GPT-3, outperformprevious architectures on most natural language processing tasks. Such languagemodels are first pre-trained on gigantic corpora of text and later used asbase-model for finetuning on a particular task. Since the pre-training step isusually not repeated, base models are not up-to-date with the latestinformation. In this paper, we update RobBERT, a RoBERTa-based state-of-the-artDutch language model, which was trained in 2019. First, the tokenizer ofRobBERT is updated to include new high-frequent tokens present in the latestDutch OSCAR corpus, e.g. corona-related words. Then we further pre-train theRobBERT model using this dataset. To evaluate if our new model is a plug-inreplacement for RobBERT, we introduce two additional criteria based on conceptdrift of existing tokens and alignment for novel tokens.We found that forcertain language tasks this update results in a significant performanceincrease. These results highlight the benefit of continually updating alanguage model to account for evolving language use.

Quick Read (beta)

loading the full paper ...