MULE: Multimodal Universal Language Embedding

Abstract

Existing vision-language methods typically support two languages at a time atmost. In this paper, we present a modular approach which can easily beincorporated into existing vision-language methods in order to support manylanguages. We accomplish this by learning a single shared Multimodal UniversalLanguage Embedding (MULE) which has been visually-semantically aligned acrossall languages. Then we learn to relate the MULE to visual data as if it were asingle language. Our method is not architecture specific, unlike prior workwhich typically learned separate branches for each language, enabling ourapproach to easily be adapted to many vision-language methods and tasks. SinceMULE learns a single language branch in the multimodal model, we can also scaleto support many languages, and languages with fewer annotations to takeadvantage of the good representation learned from other (more abundant)language data. We demonstrate the effectiveness of our embeddings on thebidirectional image-sentence retrieval task, supporting up to four languages ina single model. In addition, we show that Machine Translation can be used fordata augmentation in multilingual learning, which, combined with MULE, improvesmean recall by up to 20.2% on a single language compared to prior work, withthe most significant gains seen on languages with relatively few annotations.

Quick Read (beta)

loading the full paper ...