Most existing work that grounds natural language phrases in images startswith the assumption that the phrase in question is relevant to the image. Inthis paper we address a more realistic version of the natural languagegrounding task where we must both identify whether the phrase is relevant to animage and localize the phrase. This can also be viewed as a generalization ofobject detection to an open-ended vocabulary, introducing elements of few- andzero-shot detection. We propose an approach for this task that extends FasterR-CNN to relate image regions and phrases. By carefully initializing theclassification layers of our network using canonical correlation analysis(CCA), we encourage a solution that is more discerning when reasoning betweensimilar phrases, resulting in over double the performance compared to a naiveadaptation on two popular phrase grounding datasets, Flickr30K Entities andReferIt Game, with test-time phrase vocabulary sizes of 5K and 32K,respectively.