Effect of Vision-and-Language Extensions on Natural Language Understanding in Vision-and-Language Models

Abstract

Extending language models with structural modifications andvision-and-language (V&L) pretraining are successful ways of making V&L modelsthat can ground vision and language. Potential applications of these advancedmodels include multi-modal machine reading comprehension models and multi-modaldialogue models, which require language ability upon grounding. Althoughlanguage capability is crucial for such applications, the impact of extendingtheir visual capabilities on their language capabilities is not fullyunderstood. This paper investigates how visual extension affects the languagecapability of V&L models using the GLUE benchmark. We found that visualextension causes some decreases in language capability and that V&L pretraininghas a greater impact than structural modifications on the decreases. Ourresults suggest the need for further study on pretraining that can maintain or,if possible, improve a model's language capability.

Quick Read (beta)

loading the full paper ...