Grounding Referring Expressions in Images by Variational Context

Abstract

We focus on grounding (i.e., localizing or linking) referring expressions inimages, e.g., "largest elephant standing behind baby elephant". This is ageneral yet challenging vision-language task since it does not only require thelocalization of objects, but also the multimodal comprehension of context ---visual attributes (e.g., "largest", "baby") and relationships (e.g., "behind")that help to distinguish the referent from other objects, especially those ofthe same category. Due to the exponential complexity involved in modeling thecontext associated with multiple image regions, existing work oversimplifiesthis task to pairwise region modeling by multiple instance learning. In thispaper, we propose a variational Bayesian method, called Variational Context, tosolve the problem of complex context modeling in referring expressiongrounding. Our model exploits the reciprocal relation between the referent andcontext, i.e., either of them influences the estimation of the posteriordistribution of the other, and thereby the search space of context can begreatly reduced, resulting in better localization of referent. We develop anovel cue-specific language-vision embedding network that learns thisreciprocity model end-to-end. We also extend the model to the unsupervisedsetting where no annotation for the referent is available. Extensiveexperiments on various benchmarks show consistent improvement overstate-of-the-art methods in both supervised and unsupervised settings.

Quick Read (beta)

loading the full paper ...