Linguistic Entity Masking to Improve Cross-Lingual Representation of Multilingual Language Models for Low-Resource Languages

Abstract

Multilingual Pre-trained Language models (multiPLMs), trained on the MaskedLanguage Modelling (MLM) objective are commonly being used for cross-lingualtasks such as bitext mining. However, the performance of these models is stillsuboptimal for low-resource languages (LRLs). To improve the languagerepresentation of a given multiPLM, it is possible to further pre-train it.This is known as continual pre-training. Previous research has shown thatcontinual pre-training with MLM and subsequently with Translation LanguageModelling (TLM) improves the cross-lingual representation of multiPLMs.However, during masking, both MLM and TLM give equal weight to all tokens inthe input sequence, irrespective of the linguistic properties of the tokens. Inthis paper, we introduce a novel masking strategy, Linguistic Entity Masking(LEM) to be used in the continual pre-training step to further improve thecross-lingual representations of existing multiPLMs. In contrast to MLM andTLM, LEM limits masking to the linguistic entity types nouns, verbs and namedentities, which hold a higher prominence in a sentence. Secondly, we limitmasking to a single token within the linguistic entity span thus keeping morecontext, whereas, in MLM and TLM, tokens are masked randomly. We evaluate theeffectiveness of LEM using three downstream tasks, namely bitext mining,parallel data curation and code-mixed sentiment analysis using threelow-resource language pairs English-Sinhala, English-Tamil, and Sinhala-Tamil.Experiment results show that continually pre-training a multiPLM with LEMoutperforms a multiPLM continually pre-trained with MLM+TLM for all threetasks.

Quick Read (beta)

loading the full paper ...