DPRK-BERT: The Supreme Language Model

Abstract

Deep language models have achieved remarkable success in the NLP domain. Thestandard way to train a deep language model is to employ unsupervised learningfrom scratch on a large unlabeled corpus. However, such large corpora are onlyavailable for widely-adopted and high-resource languages and domains. Thisstudy presents the first deep language model, DPRK-BERT, for the DPRK language.We achieve this by compiling the first unlabeled corpus for the DPRK languageand fine-tuning a preexisting the ROK language model. We compare the proposedmodel with existing approaches and show significant improvements on two DPRKdatasets. We also present a cross-lingual version of this model which yieldsbetter generalization across the two Korean languages. Finally, we providevarious NLP tools related to the DPRK language that would foster futureresearch.

Quick Read (beta)

loading the full paper ...