ViDeBERTa: A powerful pre-trained language model for Vietnamese

Abstract

This paper presents ViDeBERTa, a new pre-trained monolingual language modelfor Vietnamese, with three versions - ViDeBERTa_xsmall, ViDeBERTa_base, andViDeBERTa_large, which are pre-trained on a large-scale corpus of high-qualityand diverse Vietnamese texts using DeBERTa architecture. Although manysuccessful pre-trained language models based on Transformer have been widelyproposed for the English language, there are still few pre-trained models forVietnamese, a low-resource language, that perform good results on downstreamtasks, especially Question answering. We fine-tune and evaluate our model onthree important natural language downstream tasks, Part-of-speech tagging,Named-entity recognition, and Question answering. The empirical resultsdemonstrate that ViDeBERTa with far fewer parameters surpasses the previousstate-of-the-art models on multiple Vietnamese-specific natural languageunderstanding tasks. Notably, ViDeBERTa_base with 86M parameters, which is onlyabout 23% of PhoBERT_large with 370M parameters, still performs the same orbetter results than the previous state-of-the-art model. Our ViDeBERTa modelsare available at: https://github.com/HySonLab/ViDeBERTa.

Quick Read (beta)

loading the full paper ...