Abstract
Vision and Language (VL) models have demonstrated remarkable zero-shotperformance in a variety of tasks. However, some aspects of complex languageunderstanding still remain a challenge. We introduce the collective notion ofStructured Vision&Language Concepts (SVLC) which includes object attributes,relations, and states which are present in the text and visible in the image.Recent studies have shown that even the best VL models struggle with SVLC. Apossible way of fixing this issue is by collecting dedicated datasets forteaching each SVLC type, yet this might be expensive and time-consuming.Instead, we propose a more elegant data-driven approach for enhancing VLmodels' understanding of SVLCs that makes more effective use of existing VLpre-training datasets and does not require any additional data. While automaticunderstanding of image structure still remains largely unsolved, languagestructure is much better modeled and understood, allowing for its effectiveutilization in teaching VL models. In this paper, we propose various techniquesbased on language structure understanding that can be used to manipulate thetextual part of off-the-shelf paired VL datasets. VL models trained with theupdated data exhibit a significant improvement of up to 15% in their SVLCunderstanding with only a mild degradation in their zero-shot capabilities bothwhen training from scratch or fine-tuning a pre-trained model.