RobBERT: a Dutch RoBERTa-based Language Model

Abstract

Pre-trained language models have been dominating the field of naturallanguage processing in recent years, and have led to significant performancegains for various complex natural language tasks. One of the most prominentpre-trained language models is BERT, which was released as an English as wellas a multilingual version. Although multilingual BERT performs well on manytasks, recent studies show that BERT models trained on a single languagesignificantly outperform the multilingual version. Training a Dutch BERT modelthus has a lot of potential for a wide range of Dutch NLP tasks. While previousapproaches have used earlier implementations of BERT to train a Dutch versionof BERT, we used RoBERTa, a robustly optimized BERT approach, to train a Dutchlanguage model called RobBERT. We measured its performance on various tasks aswell as the importance of the fine-tuning dataset size. We also evaluated theimportance of language-specific tokenizers and the model's fairness. We foundthat RobBERT improves state-of-the-art results for various tasks, andespecially significantly outperforms other models when dealing with smallerdatasets. These results indicate that it is a powerful pre-trained model for alarge variety of Dutch language tasks. The pre-trained and fine-tuned modelsare publicly available to support further downstream Dutch NLP applications.

Quick Read (beta)

loading the full paper ...