How to Adapt Pre-trained Vision-and-Language Models to a Text-only Input?

Abstract

Current language models have been criticised for learning language from textalone without connection between words and their meaning. Consequently,multimodal training has been proposed as a way for creating models with betterlanguage understanding by providing the lacking connection. We focus onpre-trained multimodal vision-and-language (VL) models for which there alreadyare some results on their language understanding capabilities. An unresolvedissue with evaluating the linguistic skills of these models, however, is thatthere is no established method for adapting them to text-only input withoutout-of-distribution uncertainty. To find the best approach, we investigate andcompare seven possible methods for adapting three different pre-trained VLmodels to text-only input. Our evaluations on both GLUE and Visual PropertyNorms (VPN) show that care should be put into adapting VL models to zero-shottext-only tasks, while the models are less sensitive to how we adapt them tonon-zero-shot tasks. We also find that the adaptation methods performdifferently for different models and that unimodal model counterparts performon par with the VL models regardless of adaptation, indicating that current VLmodels do not necessarily gain better language understanding from theirmultimodal training.

Quick Read (beta)

loading the full paper ...