From Universal Language Model to Downstream Task: Improving RoBERTa-Based Vietnamese Hate Speech Detection

  • 2021-02-24 09:30:55
  • Quang Huu Pham, Viet Anh Nguyen, Linh Bao Doan, Ngoc N. Tran, Ta Minh Thanh
  • 1

Abstract

Natural language processing is a fast-growing field of artificialintelligence. Since the Transformer was introduced by Google in 2017, a largenumber of language models such as BERT, GPT, and ELMo have been inspired bythis architecture. These models were trained on huge datasets and achievedstate-of-the-art results on natural language understanding. However,fine-tuning a pre-trained language model on much smaller datasets fordownstream tasks requires a carefully-designed pipeline to mitigate problems ofthe datasets such as lack of training data and imbalanced data. In this paper,we propose a pipeline to adapt the general-purpose RoBERTa language model to aspecific text classification task: Vietnamese Hate Speech Detection. We firsttune the PhoBERT on our dataset by re-training the model on the Masked LanguageModel task; then, we employ its encoder for text classification. In order topreserve pre-trained weights while learning new feature representations, wefurther utilize different training techniques: layer freezing, block-wiselearning rate, and label smoothing. Our experiments proved that our proposedpipeline boosts the performance significantly, achieving a new state-of-the-arton Vietnamese Hate Speech Detection campaign with 0.7221 F1 score.

 

Quick Read (beta)

loading the full paper ...