cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation

Abstract

Vision-and-language tasks are gaining popularity in the research community,but the focus is still mainly on English. We propose a pipeline that utilizesEnglish-only vision-language models to train a monolingual model for a targetlanguage. We propose to extend OSCAR+, a model which leverages object tags asanchor points for learning image-text alignments, to train on visual questionanswering datasets in different languages. We propose a novel approach toknowledge distillation to train the model in other languages using parallelsentences. Compared to other models that use the target language in thepretraining corpora, we can leverage an existing English model to transfer theknowledge to the target language using significantly lesser resources. We alsorelease a large-scale visual question answering dataset in Japanese and Hindilanguage. Though we restrict our work to visual question answering, our modelcan be extended to any sequence-level classification task, and it can beextended to other languages as well. This paper focuses on two languages forthe visual question answering task - Japanese and Hindi. Our pipelineoutperforms the current state-of-the-art models by a relative increase of 4.4%and 13.4% respectively in accuracy.

Quick Read (beta)

loading the full paper ...