Abstract
Existing cross-modal retrieval methods typically rely on large-scalevision-language pair data. This makes it challenging to efficiently develop across-modal retrieval model for under-resourced languages of interest.Therefore, Cross-lingual Cross-modal Retrieval (CCR), which aims to alignvision and the low-resource language (the target language) without using anyhuman-labeled target-language data, has gained increasing attention. As ageneral parameter-efficient way, a common solution is to utilize adaptermodules to transfer the vision-language alignment ability of Vision-LanguagePretraining (VLP) models from a source language to a target language. However,these adapters are usually static once learned, making it difficult to adapt totarget-language captions with varied expressions. To alleviate it, we proposeDynamic Adapter with Semantics Disentangling (DASD), whose parameters aredynamically generated conditioned on the characteristics of the input captions.Considering that the semantics and expression styles of the input captionlargely influence how to encode it, we propose a semantic disentangling moduleto extract the semantic-related and semantic-agnostic features from the input,ensuring that generated adapters are well-suited to the characteristics ofinput caption. Extensive experiments on two image-text datasets and onevideo-text dataset demonstrate the effectiveness of our model for cross-lingualcross-modal retrieval, as well as its good compatibility with various VLPmodels.