Abstract
We propose the Vision-and-Augmented-Language Transformer (VAuLT). VAuLT is anextension of the popular Vision-and-Language Transformer (ViLT), and improvesperformance on vision-and-language (VL) tasks that involve more complex textinputs than image captions while having minimal impact on training andinference efficiency. ViLT, importantly, enables efficient training andinference in VL tasks, achieved by encoding images using a linear projection ofpatches instead of an object detector. However, it is pretrained on captioningdatasets, where the language input is simple, literal, and descriptive,therefore lacking linguistic diversity. So, when working with multimedia datain the wild, such as multimodal social media data, there is a notable shiftfrom captioning language data, as well as diversity of tasks. We indeed findevidence that the language capacity of ViLT is lacking. The key insight andnovelty of VAuLT is to propagate the output representations of a large languagemodel (LM) like BERT to the language input of ViLT. We show that joint trainingof the LM and ViLT can yield relative improvements up to 20% over ViLT andachieve state-of-the-art or comparable performance on VL tasks involving richerlanguage inputs and affective constructs, such as for Target-Oriented SentimentClassification in TWITTER-2015 and TWITTER-2017, and Sentiment Classificationin MVSA-Single and MVSA-Multiple. Our code is available athttps://github.com/gchochla/VAuLT.