Abstract
Deep language models have achieved remarkable success in the NLP domain. Thestandard way to train a deep language model is to employ unsupervised learningfrom scratch on a large unlabeled corpus. However, such large corpora are onlyavailable for widely-adopted and high-resource languages and domains. Thisstudy presents the first deep language model, DPRK-BERT, for the DPRK language.We achieve this by compiling the first unlabeled corpus for the DPRK languageand fine-tuning a preexisting the ROK language model. We compare the proposedmodel with existing approaches and show significant improvements on two DPRKdatasets. We also present a cross-lingual version of this model which yieldsbetter generalization across the two Korean languages. Finally, we providevarious NLP tools related to the DPRK language that would foster futureresearch.