VinVL: Making Visual Representations Matter in Vision-Language Models

Abstract

This paper presents a detailed study of improving visual representations forvision language (VL) tasks and develops an improved object detection model toprovide object-centric representations of images. Compared to the most widelyused \emph{bottom-up and top-down} model \cite{anderson2018bottom}, the newmodel is bigger, better-designed for VL tasks, and pre-trained on much largertraining corpora that combine multiple public annotated object detectiondatasets. Therefore, it can generate representations of a richer collection ofvisual objects and concepts. While previous VL research focuses mainly onimproving the vision-language fusion model and leaves the object detectionmodel improvement untouched, we show that visual features matter significantlyin VL models. In our experiments we feed the visual features generated by thenew object detection model into a Transformer-based VL fusion model \oscar\cite{li2020oscar}, and utilize an improved approach \short\ to pre-train theVL model and fine-tune it on a wide range of downstream VL tasks. Our resultsshow that the new visual features significantly improve the performance acrossall VL tasks, creating new state-of-the-art results on seven public benchmarks.We will release the new object detection model to public.

Quick Read (beta)

loading the full paper ...