From Universal Language Model to Downstream Task: Improving RoBERTa-Based Vietnamese Hate Speech Detection

Abstract

Natural language processing is a fast-growing field of artificialintelligence. Since the Transformer was introduced by Google in 2017, a largenumber of language models such as BERT, GPT, and ELMo have been inspired bythis architecture. These models were trained on huge datasets and achievedstate-of-the-art results on natural language understanding. However,fine-tuning a pre-trained language model on much smaller datasets fordownstream tasks requires a carefully-designed pipeline to mitigate problems ofthe datasets such as lack of training data and imbalanced data. In this paper,we propose a pipeline to adapt the general-purpose RoBERTa language model to aspecific text classification task: Vietnamese Hate Speech Detection. We firsttune the PhoBERT on our dataset by re-training the model on the Masked LanguageModel task; then, we employ its encoder for text classification. In order topreserve pre-trained weights while learning new feature representations, wefurther utilize different training techniques: layer freezing, block-wiselearning rate, and label smoothing. Our experiments proved that our proposedpipeline boosts the performance significantly, achieving a new state-of-the-arton Vietnamese Hate Speech Detection campaign with 0.7221 F1 score.

Quick Read (beta)

loading the full paper ...