Knowledge Based Multilingual Language Model

Abstract

Knowledge enriched language representation learning has shown promisingperformance across various knowledge-intensive NLP tasks. However, existingknowledge based language models are all trained with monolingual knowledgegraph data, which limits their application to more languages. In this work, wepresent a novel framework to pretrain knowledge based multilingual languagemodels (KMLMs). We first generate a large amount of code-switched syntheticsentences and reasoning-based multilingual training data using the Wikidataknowledge graphs. Then based on the intra- and inter-sentence structures of thegenerated data, we design pretraining tasks to facilitate knowledge learning,which allows the language models to not only memorize the factual knowledge butalso learn useful logical patterns. Our pretrained KMLMs demonstratesignificant performance improvements on a wide range of knowledge-intensivecross-lingual NLP tasks, including named entity recognition, factual knowledgeretrieval, relation classification, and a new task designed by us, namely,logic reasoning. Our code and pretrained language models will be made publiclyavailable.

Quick Read (beta)

loading the full paper ...