A Simple and Effective Method To Eliminate the Self Language Bias in Multilingual Representations

Abstract

Language agnostic and semantic-language information isolation is an emergingresearch direction for multilingual representations models. We explore thisproblem from a novel angle of geometric algebra and semantic space. A simplebut highly effective method "Language Information Removal (LIR)" factors outlanguage identity information from semantic related components in multilingualrepresentations pre-trained on multi-monolingual data. A post-training andmodel-agnostic method, LIR only uses simple linear operations, e.g. matrixfactorization and orthogonal projection. LIR reveals that for weak-alignmentmultilingual systems, the principal components of semantic spaces primarilyencodes language identity information. We first evaluate the LIR on across-lingual question answer retrieval task (LAReQA), which requires thestrong alignment for the multilingual embedding space. Experiment shows thatLIR is highly effectively on this task, yielding almost 100% relativeimprovement in MAP for weak-alignment models. We then evaluate the LIR onAmazon Reviews and XEVAL dataset, with the observation that removing languageinformation is able to improve the cross-lingual transfer performance.

Quick Read (beta)

loading the full paper ...