Abstract
Both image-caption pairs and translation pairs provide the means to learndeep representations of and connections between languages. We use both types ofpairs in MURAL (MUltimodal, MUltitask Representations Across Languages), a dualencoder that solves two tasks: 1) image-text matching and 2) translation pairmatching. By incorporating billions of translation pairs, MURAL extends ALIGN(Jia et al. PMLR'21)--a state-of-the-art dual encoder learned from 1.8 billionnoisy image-text pairs. When using the same encoders, MURAL's performancematches or exceeds ALIGN's cross-modal retrieval performance on well-resourcedlanguages across several datasets. More importantly, it considerably improvesperformance on under-resourced languages, showing that text-text learning canovercome a paucity of image-caption examples for these languages. On theWikipedia Image-Text dataset, for example, MURAL-base improves zero-shot meanrecall by 8.1% on average for eight under-resourced languages and by 6.8% onaverage when fine-tuning. We additionally show that MURAL's textrepresentations cluster not only with respect to genealogical connections butalso based on areal linguistics, such as the Balkan Sprachbund.