Training Multilingual Pre-trained Language Model with Byte-level Subwords

  • 2021-06-03 14:37:37
  • Junqiu Wei, Qun Liu, Yinpeng Guo, Xin Jiang
  • 0

Abstract

The pre-trained language models have achieved great successes in variousnatural language understanding (NLU) tasks due to its capacity to capture thedeep contextualized information in text by pre-training on large-scale corpora.One of the fundamental components in pre-trained language models is thevocabulary, especially for training multilingual models on many differentlanguages. In the technical report, we present our practices on trainingmultilingual pre-trained language models with BBPE: Byte-Level BPE (i.e., BytePair Encoding). In the experiment, we adopted the architecture of NEZHA as theunderlying pre-trained language model and the results show that NEZHA trainedwith byte-level subwords consistently outperforms Google multilingual BERT andvanilla NEZHA by a notable margin in several multilingual NLU tasks. We releasethe source code of our byte-level vocabulary building tools and themultilingual pre-trained language models.

 

Quick Read (beta)

loading the full paper ...