Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning

Abstract

While vision-language pre-trained models (VL-PTMs) have advanced multimodalresearch in recent years, their mastery in a few languages like Englishrestricts their applicability in broader communities. To this end, there is anincreasing interest in developing multilingual VL models via a joint-learningsetup, which, however, could be unrealistic due to expensive costs and dataavailability. In this work, we propose to extend VL-PTMs' language capacity bycontinual language learning (CLL), where a model needs to update its linguisticknowledge incrementally without suffering from catastrophic forgetting (CF). Webegin our study by introducing a model dubbed CLL-CLIP, which builds upon CLIP,a prevailing VL-PTM that has acquired image-English text alignment.Specifically, CLL-CLIP contains an expandable token embedding layer to handlelinguistic differences. It solely trains token embeddings to improve memorystability and is optimized under cross-modal and cross-lingual objectives tolearn the alignment between images and multilingual texts. To alleviate CFraised by covariate shift and lexical overlap, we further propose a novelapproach that ensures the identical distribution of all token embeddings duringinitialization and regularizes token embedding learning during training. Weconstruct a CLL benchmark covering 36 languages based on MSCOCO and XM3600datasets and then evaluate multilingual image-text retrieval performance.Extensive experiments verify the effectiveness of CLL-CLIP and show that ourapproach can boost CLL-CLIP, e.g., by 6.7% in text-to-image average Recall@1 onXM3600, and improve various state-of-the-art methods consistently. Our code anddata are available at \url{https://github.com/yangbang18/CLFM}.

Quick Read (beta)

loading the full paper ...