Revisiting Image-Language Networks for Open-ended Phrase Detection

Abstract

Most existing work that grounds natural language phrases in images startswith the assumption that the phrase in question is relevant to the image. Inthis paper we address a more realistic version of the natural languagegrounding task where we must both identify whether the phrase is relevant to animage and localize the phrase. This can also be viewed as a generalization ofobject detection to an open-ended vocabulary, introducing elements of few- andzero-shot detection. We propose an approach for this task that extends FasterR-CNN to relate image regions and phrases. By carefully initializing theclassification layers of our network using canonical correlation analysis(CCA), we encourage a solution that is more discerning when reasoning betweensimilar phrases, resulting in over double the performance compared to a naiveadaptation on two popular phrase grounding datasets, Flickr30K Entities andReferIt Game, with test-time phrase vocabulary sizes of 5K and 32K,respectively.

Quick Read (beta)

loading the full paper ...