Overcoming Language Priors in Visual Question Answering with Adversarial Regularization

Abstract

Modern Visual Question Answering (VQA) models have been shown to rely heavilyon superficial correlations between question and answer words learned duringtraining such as overwhelmingly reporting the type of room as kitchen or thesport being played as tennis, irrespective of the image. Most alarmingly, thisshortcoming is often not well reflected during evaluation because the samestrong priors exist in test distributions; however, a VQA system that fails toground questions in image content would likely perform poorly in real-worldsettings. In this work, we present a novel regularization scheme for VQA thatreduces this effect. We introduce a question-only model that takes as input thequestion encoding from the VQA model and must leverage language biases in orderto succeed. We then pose training as an adversarial game between the VQA modeland this question-only adversary -- discouraging the VQA model from capturinglanguage biases in its question encoding. Further,we leverage thisquestion-only model to estimate the increase in model confidence afterconsidering the image, which we maximize explicitly to encourage visualgrounding. Our approach is a model agnostic training procedure and simple toimplement. We show empirically that it can improve performance significantly ona bias-sensitive split of the VQA dataset for multiple base models -- achievingstate-of-the-art on this task. Further, on standard VQA tasks, our approachshows significantly less drop in accuracy compared to existing bias-reducingVQA models.

Quick Read (beta)

loading the full paper ...