MURAL: Multimodal, Multitask Retrieval Across Languages

Abstract

Both image-caption pairs and translation pairs provide the means to learndeep representations of and connections between languages. We use both types ofpairs in MURAL (MUltimodal, MUltitask Representations Across Languages), a dualencoder that solves two tasks: 1) image-text matching and 2) translation pairmatching. By incorporating billions of translation pairs, MURAL extends ALIGN(Jia et al. PMLR'21)--a state-of-the-art dual encoder learned from 1.8 billionnoisy image-text pairs. When using the same encoders, MURAL's performancematches or exceeds ALIGN's cross-modal retrieval performance on well-resourcedlanguages across several datasets. More importantly, it considerably improvesperformance on under-resourced languages, showing that text-text learning canovercome a paucity of image-caption examples for these languages. On theWikipedia Image-Text dataset, for example, MURAL-base improves zero-shot meanrecall by 8.1% on average for eight under-resourced languages and by 6.8% onaverage when fine-tuning. We additionally show that MURAL's textrepresentations cluster not only with respect to genealogical connections butalso based on areal linguistics, such as the Balkan Sprachbund.

Quick Read (beta)

loading the full paper ...