ICU: Conquering Language Barriers in Vision-and-Language Modeling by Dividing the Tasks into Image Captioning and Language Understanding

Abstract

Most multilingual vision-and-language (V&L) research aims to accomplishmultilingual and multimodal capabilities within one model. However, thescarcity of multilingual captions for images has hindered the development. Toovercome this obstacle, we propose ICU, Image Caption Understanding, whichdivides a V&L task into two stages: a V&L model performs image captioning inEnglish, and a multilingual language model (mLM), in turn, takes the caption asthe alt text and performs cross-lingual language understanding. The burden ofmultilingual processing is lifted off V&L model and placed on mLM. Since themultilingual text data is relatively of higher abundance and quality, ICU canfacilitate the conquering of language barriers for V&L models. In experimentson two tasks across 9 languages in the IGLUE benchmark, we show that ICU canachieve new state-of-the-art results for five languages, and comparable resultsfor the rest.

Quick Read (beta)

loading the full paper ...