Visual Perturbation-aware Collaborative Learning for Overcoming the Language Prior Problem

Abstract

Several studies have recently pointed that existing Visual Question Answering(VQA) models heavily suffer from the language prior problem, which refers tocapturing superficial statistical correlations between the question type andthe answer whereas ignoring the image contents. Numerous efforts have beendedicated to strengthen the image dependency by creating the delicate models orintroducing the extra visual annotations. However, these methods cannotsufficiently explore how the visual cues explicitly affect the learned answerrepresentation, which is vital for language reliance alleviation. Moreover,they generally emphasize the class-level discrimination of the learned answerrepresentation, which overlooks the more fine-grained instance-level patternsand demands further optimization. In this paper, we propose a novelcollaborative learning scheme from the viewpoint of visual perturbationcalibration, which can better investigate the fine-grained visual effects andmitigate the language prior problem by learning the instance-levelcharacteristics. Specifically, we devise a visual controller to construct twosorts of curated images with different perturbation extents, based on which thecollaborative learning of intra-instance invariance and inter-instancediscrimination is implemented by two well-designed discriminators. Besides, weimplement the information bottleneck modulator on latent space for further biasalleviation and representation calibration. We impose our visualperturbation-aware framework to three orthodox baselines and the experimentalresults on two diagnostic VQA-CP benchmark datasets evidently demonstrate itseffectiveness. In addition, we also justify its robustness on the balanced VQAbenchmark.

Quick Read (beta)

loading the full paper ...