Revisiting Image-Language Networks for Open-ended Phrase Detection

  • 2019-11-02 01:05:23
  • Bryan A. Plummer, Kevin J. Shih, Yichen Li, Ke Xu, Svetlana Lazebnik, Stan Sclaroff, Kate Saenko
  • 0


Most existing work that grounds natural language phrases in images startswith the assumption that the phrase in question is relevant to the image. Inthis paper we address a more realistic version of the natural languagegrounding task where we must both identify whether the phrase is relevant to animage and localize the phrase. This can also be viewed as a generalization ofobject detection to an open-ended vocabulary, introducing elements of few- andzero-shot detection. We propose an approach for this task that extends FasterR-CNN to relate image regions and phrases. By carefully initializing theclassification layers of our network using canonical correlation analysis(CCA), we encourage a solution that is more discerning when reasoning betweensimilar phrases, resulting in over double the performance compared to a naiveadaptation on two popular phrase grounding datasets, Flickr30K Entities andReferIt Game, with test-time phrase vocabulary sizes of 5K and 32K,respectively.


Quick Read (beta)

Revisiting Image-Language Networks for Open-ended Phrase Detection

Bryan A. Plummer, Kevin J. Shih, Yichen Li, Ke Xu, Svetlana Lazebnik, , Stan Sclaroff, , Kate Saenko Bryan A. Plummer, Yichen Li, Stan Sclaroff, and Kate Saenko are with the Department of Computer Science, Boston University, Boston, MA, 02215. E-mail: {bplum,liych,sclaroff,saenko} Kevin J. Shih is with the NVIDIA Corporation, Santa Clara, CA 95051. Email: [email protected] Ke Xu and Svetlana Lazebnik are with the Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801. Email: {kexu6,slazebni}

Most existing work that grounds natural language phrases in images starts with the assumption that the phrase in question is relevant to the image. In this paper we address a more realistic version of the natural language grounding task where we must both identify whether the phrase is relevant to an image and localize the phrase. This can also be viewed as a generalization of object detection to an open-ended vocabulary, introducing elements of few- and zero-shot detection. We propose an approach for this task that extends Faster R-CNN to relate image regions and phrases. By carefully initializing the classification layers of our network using canonical correlation analysis (CCA), we encourage a solution that is more discerning when reasoning between similar phrases, resulting in over double the performance compared to a naive adaptation on two popular phrase grounding datasets, Flickr30K Entities and ReferIt Game, with test-time phrase vocabulary sizes of 5K and 32K, respectively.

Vision and language, phrase grounding, object detection, representation learning

1 Introduction

Traditionally, object detection and localization benchmarks have focused on relatively few categories with many samples per each category. As methods tailored for such benchmarks reach maturity, the recognition community is starting to push towards larger-vocabulary, long-tailed scenarios – see, e.g., [Gupta19] for a recent example. If we allow object categories to be described by freeform natural language phrases, we arrive at the task of phrase detection.

Recently there has been a lot of research on various phrase grounding or localization scenarios. In the most common scenario, the text query to be localized is assumed to be present in the test image [ChenICMR2017, ChenICCV2017, fukui16emnlp, hu2015natural, kazemzadeh-EtAl:2014:EMNLP2014, mao2016generation, plummerCITE2017, plummerPLCLC2017, flickrentitiesijcv, rohrbach2015, wangTwoBranch2017, wang2016matching, yehNIPS2017]. The few works that relax this assumption make other simplifications, e.g., limiting the number of negative or distractor images for a query [Zhang_2017_CVPR], or limiting queries to the most common phrases in a dataset [hinamiARXIV2017]. Despite significant gains reported in these and other works, phrase grounding has, so far, failed to yield decisive improvements in downstream applications for which it would seem to be a natural building block, such as text-to-image search, image captioning, or visual question answering. We believe this is due at least in part to the restricted definitions of grounding adopted in existing work.

In this paper we consider a true phrase detection benchmark without any simplifications or restrictions. Given a query phrase, the goal is to identify every image region associated with that phrase within a database of test images. Figure 1 contrasts this definition with the far more popular phrase localization scenario in which the goal is only to find queries that are present somewhere in the image. Phrase localization is typically evaluated using accuracy, or the percentage of queries correctly localized. To obtain good performance, it is sufficient to simply identify the relevant region for a given phrase. Crucially, there is no need to ensure consistent calibration of scores across different test images. By contrast, phrase detection is evaluated using average precision, which measures the ability of the model to separate positive regions from negative ones across the entire set of test images – a much more demanding criterion. In other words, a good phrase detector must return region-phrase scores that can be consistently interpreted as probabilities or confidences that the given phrase describes the given region. As our experiments will demonstrate, this is a very hard challenge indeed for most existing methods.

A good phrase detector must also be able to localize phrases well, but a good phrase localizer may be a poor detector. This is because in standard phrase localization benchmarks, distinguishing between closely related phrases is usually unnecessary. To give a concrete example, in the Flickr30K Entities benchmark [flickrentitiesijcv], 60% of test images have a single annotated person reference, and 27% have only two. If we assume every person is annotated in these images and we assigned predicted person boxes to person phrases at random, we would expect to get 79% of the person references correct. This is even more pronounced for other phrase types – e.g., an oracle vehicle detector would get 95% of vehicle phrases correct since most images only contain a single reference to a vehicle. Thus, phrase localization often degenerates to simply identifying the basic object category the phrase refers to. As a consequence, models trained for phrase localization tend to overfit to these conditions, and have little ability to identify which phrases are actually relevant to an image. Thus, simple phrase grounding methods such as Canonical Correlation Analysis (CCA) [hotelling1936relations] tend to perform much better on phrase detection than the state-of-the-art phrase localization methods.

Fig. 1: In this paper we address the task of phrase detection, where the goal is to identify any image regions related to a query phrase and remove all other candidates. This is a more challenging version of the phrase localization task addressed in prior work that assumes ground truth image-phrase pairs are given at test time.

To demonstrate this, we benchmark several methods that perform well on the phrase localization task. These include the classical CCA baseline, as well as several state-of-the-art methods including two-branch embedding networks [wangTwoBranch2017], conditional image-text embedding networks (CITE) [plummerCITE2017], and Query-Adaptive R-CNN [hinamiARXIV2017]. Although each of these methods outperforms CCA on phrase localization, CCA sometimes doubles their performance on the phrase detection task.

As the analysis of Section 6 will demonstrate, CCA’s surprisingly good performance is due to its ability to better discriminate across similar phrases than its neural network counterparts. This makes sense as CCA can be seen as whitening and then aligning the image regions and text features by looking at the entire dataset. In contrast, the minibatches used to train neural network-based approaches can only see a tiny portion of the data at a time due to limits of GPU memory. This makes distinguishing between similar phrases difficult, since only a few of them can be included in each minibatch during training.

The results of our experiments inspire us to reconsider CCA’s role in vision-language tasks. Rather than being used only as a baseline, it can be seen as a data whitening or normalization procedure that can be used to initialize the layers of a neural network responsible for mapping together visual and textual representations, instead of using the standard random initialization. For a phrase detection model, this means we could fine-tune the CCA weights, which are trained only using positive region-phrase pairs, to make them more discriminative by showing them both positive and negative pairs. This results in a model that gets the best performance on both phrase detection and localization.

In addition to taking a fresh look at the role of CCA for challenging image-language tasks, we evaluate several additional ideas to further boost performance. First, we use WordNet [Miller:1995:WLD:219717.219748] to perform positive phrase augmentation (PPA), which identifies valid alternatives to annotated phrases. With this approach, we would consider a person to be a positive for a region annotated with a construction worker even if they were not annotated as such. This helps mitigate the annotation sparsity issues in phrase grounding datasets. Second, to help reduce overfitting to the phrase localization task, we use inverse frequency sampling (IFS), which biases our minibatches to select less common phrases during training. Many of these rarer phrases refer to fine-grained concepts, and by including them more often we encourage our model to learn how to separate them.

Our contributions may be summarized as follows. We argue that unrestricted phrase detection is a more challenging and meaningful task for visual grounding than the currently more popular phrase localization benchmark, in which the query phrase is already assumed to be present in the image. In the proposed detection framework, we perform a comparative evaluation of several approaches that have demonstrated state-of-the-art results on phrase localization. The evaluation is done on two popular grounding datasets, Flickr30K Entities [flickrentitiesijcv] and ReferIt Game [kazemzadeh-EtAl:2014:EMNLP2014], with test set phrase vocabulary sizes of 5K and 32K, respectively. As the performance measure, we report mean average precision (mAP) across these entire vocabularies, further broken down by phrase frequencies. Perhaps surprisingly, we find that the relative standing of different approaches on the localization and detection tasks is quite different. In particular, state-of-the-art phrase localization models tend to overfit to phrase localization and perform relatively poorly on detection, while seemingly “simpler” CCA baselines actually have a better ability to discriminate between similar phrases and produce scores that are more predictive of the presence of a phrase in the image. Ultimately, we obtain our best detection performance by fine-tuning a CCA-initialized model, suggesting that CCA may be best thought of as a basic data alignment or normalization procedure that improves performance for cross-modal tasks. We have made our code publicly available11 1 to encourage further research using the same phrase detection benchmark.

Fig. 2: Model Overview. Our phrase detection model follows the Faster R-CNN paradigm (shown on the left) consisting of a region proposal network, a bounding box regressor, and a region classifier. For the region classifier, which separates regions that are relevant to a phrase from irrelevant regions, we benchmark several variants on phrase detection based on methods used for the phrase localization task. We find careful initialization of the region classification layers are key to enable our model to discriminate between related phrases.

2 Related Work

Most existing visual grounding approaches are neural networks that fuse visual and textual representations. Text features are typically obtained from pre-trained language embeddings like Word2Vec [mikolov2013efficient], FastText [bojanowski:hal-01154523], or BERT [devlin2018bert], and visual features are often obtained from pre-trained convolutional neural networks (CNNs) such as VGG [simonyan2014very] or ResNet [He2015] models, which are also used as a backbone in a Faster R-CNN network [renNIPS15fasterrcnn] adapted to visual grounding (e.g., [hinamiARXIV2017, ChenICCV2017]). Typical fusion strategies involve learning a cross-modal embedding space between image region and phrase features where distances are meaningful [flickrentitiesijcv, wang2016CVPR, wangTwoBranch2017] or a classifier on top of a fused region-phrase representation [ChenICCV2017, fukui16emnlp, wangTwoBranch2017, plummerCITE2017].

As explained in the Introduction, much of the prior work in phrase localization assumes each query phrase is actually present in the image. Some papers have investigated fusion strategies for image and text features (e.g[wangTwoBranch2017, rohrbach2015, fukui16emnlp, hu2015natural, plummerCITE2017]). Others have focused on how candidate regions are selected [yehNIPS2017, ChenICCV2017] or incorporated more sophisticated linguistic cues [Luo_2017_CVPR, plummerPLCLC2017, wang2016matching, Liu_2017_ICCV]. Many attention models for tasks like image captioning or visual question answering tries to associate individual words with regions of an image (e.g., [fang2014captions, MisraNoisy16, Yao_2017_ICCV, Anderson_2018_CVPR, lee2018stacked]). However, Liu et al[liuAAAI2017] showed these kinds of models did not localize the individual concepts as well as supervised phrase grounding methods.

More relevant to our work are phrase grounding methods that do not assume a ground truth image-phrase pair is provided at test time. Zhang et al[Zhang_2017_CVPR] evaluated a scenario in which each query phrase had to be localized in all its positive images (i.e., images known to contain the phrase) plus a limited number of negative (distractor) images. Hinami and Satoh [hinamiARXIV2017] addressed a simpler version of phrase detection covering just the most common phrases (<0.001% of the available text queries), effectively ignoring the challenging zero- and few-shot aspects of phrase detection. As we will show, the largest disparity in performance across methods comes from how well they handle these uncommon phrases. Thus, including these aspects has significant ramifications for how we evaluate compared methods.

A task related to phrase detection is dense captioning [densecap2015] generates descriptions for image regions. However, Zhang et al[Zhang_2017_CVPR] showed this model performs quite poorly on the standard phrase localization task. This likely can be attributed, in part, to the difficulty in capturing details that has been a well-documented problem for image captioning methods. It also explains the trend on the bidirectional image-sentence retrieval task where discriminative methods tend to perform better than generative models [lee2018stacked, Wehrmann_2018_CVPR]. The metrics used to evaluate dense captioning are also quite different. For this task, describing a region with the phrase a young man when it should be associated with the phrase an old man would be considered as mostly correct, whereas in our formulation it is considered completely incorrect.

Also related are recently proposed tasks of visual relationship detection (VRD) [lu2016visual] and scene graph generation [Johnson2015CVPR]. In VRD the task is to determine which relationships exist in an image and localize them, but is often performed over a limited set of pre-defined objects and predicates measured in [email protected][50, 100]. One need only localize a single instance of a relationship for it to be considered correct even if multiple instances are present in an image, and images with few relationships can also have many incorrect predictions without penalty. Like phrase detection, scene graph generation identifies large vocabularies of concepts in images. However, like phrase localization, scene graphs are evaluated on how well they localize known entities in an image and not on their ability to discriminate between images that contain a phrase and those that do not. In contrast, phrase detection uses an open-ended vocabulary with metrics that takes into account how well a model separates all correct and incorrect predictions.

3 Phrase Detection Model

This section describes the models we use in our study of phrase detection. As our backbone, we adopt the Faster R-CNN [renNIPS15fasterrcnn] architecture, as illustrated in Figure 2. First, as in the standard Faster R-CNN, an initial image representation is computed using a convolutional neural network (CNN), and the resulting feature map is fed into a region proposal network (RPN) that generates a set of candidate bounding boxes (Section 3.1). A single set of proposals is generated per image, i.e., the RPN is agnostic to the phrases being detected. Next, we use an ROI Pooling layer [girshickICCV15fastrcnn] to obtain a feature representation for each region proposal selected by the RPN. In the standard Faster R-CNN, the ROI feature gets fed into a bounding box regressor (BBReg) and region classifier to predict a refinement to the bounding box coordinates and a set of per-class probabilities for the region. To extend this architecture to phrase detection, we introduce a text embedding branch that computes an encoding of the target phrase, and its output is fused with the ROIPool feature to compute a joint region-phrase representation that is used by both the BBReg and region classifier layers. In particular, the output of the region classifier is the relevance score between the phrases being detected and the image regions (Section 3.3). When obtaining scores for multiple batches of phrases, the CNN representation and RPN/ROI Pooling need only be computed once since they are not phrase-specific, significantly reducing the computational cost of testing a large number of phrases for each image.

As discussed in the Introduction, discriminating between closely related phrases is a key challenge of phrase detection. Accordingly, the bulk of our experimental study consists in comparing a number of region classifiers adopted from phrase localization literature, as illustrated on the right of Figure 2. We use the same RPN and BBReg components to generate region proposals for all classifiers we compare. Sections 3.1 and 3.2 will describe the design of the RPN and BBReg components in detail, while Section 3.3 will introduce all the region classifiers included in our study. Later, Section 4 will discuss additional implementation aspects we found to be important for improving phrase detection performance, including initialization of region-phrase alignment layers, and augmentation and sampling procedures for phrases to deal with data sparsity and the long-tailed distribution of phrases in the training data.

3.1 Region Proposal Network

Rather than using a hand-crafted category-independent region proposal method like many earlier phrase localization approaches (e.g[hu2015natural, plummerCITE2017, flickrentitiesijcv, rohrbach2015]), we train an RPN followed by a phrase-aware bounding box regression to obtain region candidates. We follow the original RPN formulation [renNIPS15fasterrcnn], which we shall briefly review. The RPN predicts the proposals most likely to contain objects from a set of anchor boxes. These anchors are generated over an image feature map output by the CNN using different scales and aspect ratios. Positive anchor boxes are those with at least 0.7 intersection over union (IOU) with a ground truth box.

The parameters of the RPN are trained using a weighted linear combination of a log-loss over two labels indicating whether an anchor contains an object or not, along with smooth L1 (i.e. Huber) loss [girshickICCV15fastrcnn, huberloss]. Adopting the notation from Ren et al[renNIPS15fasterrcnn], let ti be the predicted box for anchor i, ti* the ground truth box, pi the likelihood of ti being an object, and pi* the indicator variable that is 1 if the anchor is positive and 0 otherwise. The RPN loss is then defined as: {dmath} L_RPN = 1 N c ∑_i^N_c L_log-loss(p_i, p_i^*) + λ 2 N r p_i^*∑_i^N_r L_smooth L1(t_i, t_i^*). where λ2 is a scalar parameter and Nr,Nc are the number of samples in a minibatch and the number of anchor locations, respectively. In our experiments, we kept all RPN-related hyperparameters the same as their defaults for training an object detector on the MSCOCO dataset [lin2014microsoft] in a publicly available implementation of Faster R-CNN22 2 We use a 101-layer ResNet [He2015] as our CNN image encoder and initialize it with a network that was trained for object detection on MSCOCO [lin2014microsoft]. During training we randomly subsample five ground truth phrases per image in each epoch as we found having balanced minibatches improve performance.

3.2 Phrase-Aware Bounding Box Regression

As stated in the beginning of Section 3, the BBReg and region classifier components of our system take as input a joint image-text representation obtained by fusing ROIPool features representing the region proposal with an embedding of the target phrase. The BBReg component is more straightforward, so we discuss it first.

Our region feature is a concatenation of the standard Region-of-Interest (ROI) features and the 5-dimensional bounding box feature shown to improve grounding performance in prior work (e.g[ChenICCV2017, plummerCITE2017]). Specifically, for an image of height H and width W and a box with height h and width w the bounding box feature is encoded as [xmin/W,ymin/H,xmax/W,ymax/H,wh/WH].

Our phrase encoding is given by HGLMM Fisher vectors [klein2014fisher], which are built on top of word2vec [mikolov2013efficient] and PCA-reduced to 6,000 dimensions. These HGLMM features are projected to the same size as the region features using a fully connected layer followed by a batch normalization layer [Ioffe:2015:BNA:3045118.3045167]. This representation is partly a legacy of our earlier work [plummerPLCLC2017, plummerCITE2017], but is also supported by our concurrent study [burnsLanguage2019], in which some of the present co-authors have discovered that HGLMM features often outperform more recent embeddings on vision-language tasks.

The input to the bounding box regressor is given by the element-wise product of the above region and phrase features. While our feature representation is different, our loss for BBReg is the same as in Ren et al[renNIPS15fasterrcnn], i.e.,

Lreg=14NriNrLsmoothL1(ti,ti*). (1)

We also use the same architecture for BBReg as et al[renNIPS15fasterrcnn], namely, a single fully connected layer. We train the RPN and BBReg first. Later we initialize the classification layers and train them as discussed in the next section. At the end, we fine-tune the whole network.

3.3 Region classifier

The region classifier’s task is to output a confidence or compatibility score given region and phrase features. The region and phrase features are the same as those used by BBReg (Section 3.2). For the fusion and classification of these features, we compare several methods that show the most promise in current phrase localization literature. Note that each region classifier uses the same bounding box proposal method to provide a fair comparison.

3.3.1 Query Adaptive R-CNN (QA R-CNN)

QA R-CNN [hinamiARXIV2017] is an earlier adaptation of Faster R-CNN to phrase grounding. To relate image regions to phrases, QA R-CNN generates the parameters of a linear classifier given the text features as input (refer to Figure 2 for an illustration). More formally, let wc be the weights of a linear classifier and v be the phrase features. The classifier is generated using wc=Wcv, where Wc is a learned projection matrix. During training, each phrase associated with an image is considered its own category in a sigmoid cross-entropy loss. In our implementation, we use a single fully connected layer for BBReg instead of a multi-layer perception as in [hinamiARXIV2017], but we found this to produce similar results.

Negative Phrase Augmentation (NPA). An important component of [hinamiARXIV2017] is an approach for sampling negative phrases at training time in order to make the model more discriminative. For every phrase associated with an image, the idea is to find the phrases a model is most often confused with and add some of them to a minibatch as “hard negative phrases.” In practice, these hard negative phrases are obtained from a confusion table that is updated every 10K iterations during training and contains 500 hard negatives for phrases in the training set.

A big potential problem with NPA is that, since phrase grounding datasets are very sparsely labeled, many putative hard negative phrases are likely unlabeled positive examples. Hinami and Satoh [hinamiARXIV2017] proposed two ways of addressing this issue. The first method is to use WordNet [Miller:1995:WLD:219717.219748] to identify hypernum relationships and remove phrases with a parent-child relationship (e.g., a person couldn’t be a hard negative for a man). However, many phrases that refer to the same object could still pass this test, e.g., a woman could still be considered a hard negative phrase for a skier. This led Hinami and Satoh to propose using the dataset annotations to identify these mutually non-exclusive phrases. If two phrases were often annotated as referring to the same object, then they also could not be used as hard negatives for each other. While this could work for common phrases, to which [hinamiARXIV2017] restricts its evaluation, in our unrestricted phrase detection scenario, many phrases occur very few times, which means NPA would mostly rely on WordNet to identify false negative phrases. We manually inspected the entries of 30 randomly selected phrases in the confusion table when trying to train NPA using our best phrase detection model, and found that 21 had obvious false negatives within the top few most confused phrases. Thus, we expect the phrase confusion table to be quite unreliable. Accordingly, in our experiments reported in Section 6, using NPA leads to negligible performance differences on phrase detection, while also increasing training time.

3.3.2 Canonical Correlation Analysis (CCA)

CCA [hotelling1936relations] is a classical method often used to benchmark vision-language tasks (including for phrase localization [flickrentitiesijcv]). The goal of CCA is to learn linear transformations U,W between two sets of paired variables X,Y (in our case, representing the image region and text features) that maximizes the correlations between them, i.e.,

max tr(WTXTYU) (2)
subject to: WTXTXW=UTYTYU=I.

CCA can be solved as a generalized eigenvalue decomposition problem, where the eigenvectors of the top eigenvalues are concatenated to form the projection matrices. In our experiments we use normalized CCA [gong14], which scales the learned projection matrices by the eigenvalues, which is well known to give better performance than regular CCA. At test time, we apply the learned CCA transformations to region and phrase features, scale these projected features, and then use cosine similarity to score regions and phrases. By the very definition of CCA, only positive region-phrase pairs are used when learning the projection parameters, and typically the entire dataset is used in a single batch. Despite CCA’s simplicity, our experiments will show that it is robust to the long-tailed distribution of phrases in existing datasets, making it a strong baseline for phrase detection.

3.3.3 Deep CCA

Deep CCA [deepcca] is intended to address two deficiencies of traditional CCA, namely, that it is a linear projection method, and that its objective cannot be back-propagated to the feature representation. Deep CCA uses a correlation loss to train the feature representation before using CCA to learn the final transformation. At training time, a singular value decomposition is computed over features in a minibatch to form an approximation of the data covariance matrix. This requires a relatively large minibatch that must also increase with the dimensionality of the desired embedding. The resulting GPU memory requirements make it difficult to train an embedding of the same dimensionality as that of linear CCA. Ultimately, we found that linear CCA outperforms Deep CCA even when they have the same output dimensions. More specifically, in our experiments, we needed a batch size of 30K to keep the loss numerically stable when learning a feature embedding of 1,024 dimensions. Thus, we kept the underlying CNN fixed when learning this feature embedding. Unlike Andrew et al[deepcca], which used three fully connected layers to learn their feature representation, we got the best performance with a single FC layer.

3.3.4 Embedding Network

The Embedding Network [wangTwoBranch2017] is a fairly general way to fuse image and text features for cross-modal retrieval and classification tasks. Our implementation (refer back to Figure 2 for an illustration) consists of two fully connected layers each for the region and phrase features, projecting them into a shared embedding space. The projections are trained with a triplet loss. Given some query phrase q, a positive region rp and negative region rn, our loss is


where d is the Euclidean distance and m is a scalar parameter representing a minimum margin between positive and negative pairs. Given some phrase q, a positive region during training rp is defined as a region with at least 0.6 IOU with the ground truth. After the model is trained, we define the confidence score for a region and a phrase as the distance between the L2-normalized embedded vectors. To get the best performance, in addition to the cross-modal triplet loss above, we also include within-modality terms as in [wangTwoBranch2017], imposed on triplets of regions and phrases, respectively.

3.3.5 Conditional Image-Text Embedding Network (CITE)

Phrases can describe a wide array of objects and can include attributes or modifiers, all of which need to be effectively represented (i.e. finding a blond woman would be considered a false positive if the query described a brunette woman). The CITE network [plummerCITE2017] (Figure 2, right) attempts to make the vision-language representation more expressive by training a set of embeddings that can specialize in identifying useful concepts for finding the phrase. The weights over these embeddings are computed from individual phrases using a trained soft attention mechanism. The final N-dimensional representation is a weighted sum over the set of N-dimensional conditional embeddings, and is then fed into a classifier implemented as a fully connected layer. This network is trained using a logistic loss with L1 regularization on the conditional weights. Let K be the number of image region-query pairs in a batch, c be the confidence in a region-query pair, l be its -1/1 label indicating whether it is a negative/positive pair, and a be the conditional embedding weights before the softmax. Then our loss is:

Lcls=i=1Klog(1+exp(-lici))+λ1L1(ai), (3)

where λ1 is a scalar parameter. We keep all hyperparameters and training procedures the same as in Plummer et al[plummerCITE2017], except for our minibatch construction. Our minibatches are built on a per-image basis, so each time an image appears in a minibatch, it is also matched to all its related phrases. Plummer et al. sampled image-phrase pairs, so two phrases matched to the same image could appear in different minibatches. While our minibatch construction does reduce training time significantly since each image is processed only once in each epoch, we found this hurts localization performance by about 1%.

4 Addressing Challenges of Open-Vocabulary Detection

The most significant difference between phrase detection and the classical task of object detection is the need to support an open vocabulary. In this section we address two of the associated challenges and how we address them. Namely, in Section 4.1 we discuss our CCA initialization procedure that helps learn a more discriminative representation for the long-tailed distribution of phrases and Section 4.2 describes how we handle label sparsity.

4.1 Initial Vision-Language Alignment

In our experiments, we obtain the best results with Embedding Networks and CITE when we initialize their projection layers using all available data, instead of learning strictly from minibatches containing a fraction of the training phrases at a time.

We assume we have two views (i.e., phrase and region features) that we would like project using a pair of fully connected (FC) layers (one for each view) so they share a single semantic space. Network layers are initialized recursively starting with the lowest layer first which ensures the input views contain a reasonable representation either from some pre-training procedure or layers that were previously initialized using our approach. Each layer h(x) (here, x can refer either to region or text features) has the following form:

h(x)=W(x-μ)σ+b. (4)

We estimate the layer’s parameters W,μ,σ using normalized CCA [gong14], and initialize b with zeros. The CCA objective, whose goal is to maximize the correlations between the two input views, can be solved using a generalized eigenvalue decomposition. The eigenvectors for the top K eigenvalues σ are concatenated to form W, μ is the mean of the input features x estimated over the entire training set, and σ is used to scale the output features which has been shown to improve performance. During training we only update the parameters W,b and keep μ,σ fixed. Letting the network update W,b arbitrarily, however, may result in catastrophic forgetting. To avoid this issue, we use L1 regularization on b (since it is zero-initialized) and L2 regularization between the initial CCA-estimated projection W and the updated parameters W, i.e.,

Lreg=λ2W-W2+b1, (5)

where λ2 is a scalar parameter that is set via grid search using validation data. We also experimented with an iterative procedure where we alternated between estimating a FC layer’s weights with CCA and fine-tuning them with the entire network, but found a single iteration was sufficient.

For CITE and Embedding Network, as illustrated in Figure 2 (right), the above initialization procedure is applied to Text Layer 1 and 2 and the corresponding Region Layer 1 and 2, and any other layers are randomly initialized.

4.2 Positive Phrase Augmentation

Phrase grounding datasets are sparsely annotated, i.e., many regions that could conceivably be described by some phrase lack that annotation – either because no human annotator chose to describe that particular region, or described it using a different phrase. In phrase localization this is not a critical issue, since performance is evaluated only on images already known to contain at least one instance of a phrase. In phrase detection, however, the goal is to consistently score all instances of a phrase across the entire dataset, making it much more likely that a high-scoring negative region is simply an unlabeled positive. To help mitigate this issue, we propose a positive phrase augmentation procedure that can be used both at training time to learn a better model, and at test time to improve the accuracy of our metrics.

We begin by collecting a set of word replacements by identifying synonyms as well as more general forms of words in a phrase by extracting the hypernym relationships using WordNet. Then, we use this set of related words as replacements for their associated words to construct additional positive labels for a phrase. For example, the phrase blue jacket we obtain word replacements coat, cover, and apparel for the word jacket. This results in candidate positive phrases blue coat, blue cover, and blue apparel.

Additional care must be taken based on how the datasets are collected. In the ReferIt dataset phrases are meant to uniquely describe an object in an image. As a result, many phrases contain references to spatial relationships with other entities (e.g., the box to the right of the table). Thus, on the ReferIt dataset we avoid replacing words like left and right. For Flickr30K Entities, however, the annotations typically refer to single entities. This means we can consider individual words in a phrase and all combinations of them as candidates. For example, for the phrase a large red house, we can consider a large house, a red house, house, and red as positive examples. On ReferIt, however, we cannot do this because breaking up a phrase like the dog on a table would incorrectly yield both dog and table as positives. After obtaining all candidate phrases associated with an image region, we only add phrases that already exist in that dataset split. This helps to filter out some potential false positives and odd phrase constructions.

Although NPA and PPA both use WordNet to identify words that may be related to a ground truth phrase, PPA uses these related words to add likely positives to an image’s annotations. NPA, by contrast, uses these related words as a filtering step to remove false negatives. Thus, the phrases that are actually added to a minibatch (i.e., the hard negatives) are simply those which the model gets confused about, but, in practice, are often false negatives that could not be identified easily.

4.3 Inverse Frequency Sampling

While every child is a person, not every person is a child, making phrases containing generic words far more common in phrase grounding datasets. Thus, when a model sees a child during training it can simply use features that identify a person, which is seen more often. Thus, every word referring to a specific subgroup within person may use the same features, making distinguishing between them difficult. One possible solution is to include hard negative phrases during training. However, as we discussed in Section 3.3.1, obtaining hard negatives automatically is non-trivial and very prone to errors. Instead, we bias our phrase sampling procedure to prefer rare phrases that are often more specific than common phrases. In this way, we can indirectly encourage our model to learn fine-grained differences.

Concretely, our inverse frequency sampling procedure (IFS) samples phrases by using a sampling budget K that is the inverse of their relative likelihood in a training set. For each image, we obtain the set of phrases and the portion of times that phrase is labeled in the dataset (i.e., if a phrase occurs 5 times in the dataset out of 20 total phrase-region pairs, then it would account for 25% of instances). Then, we renormalize the likelihoods of all the phrases in the image so that they sum to 1 and then take their inverse. For example, if a dog accounts for 15% of instances while a terrier accounts for 5% and these are the only two phrases associated with an image, then the likelihood we would sample a terrier would be 1-(0.15/(0.05+0.15))=75% and a dog 25% of the time. We ensure a phrase representing each entity in an image is including by automatically selecting all ground truth phrases, which, by definition, are also the most specific reference to the entity (i.e., we only subsample augmented phrases from Section 4.2). In our experiments we found a sampling budget of K=30 to work best.

5 Phrase Localization Experiments

We begin by evaluating the models of Figure 2 on the established task of phrase localization before moving on to phrase detection in Section 6. As explained before, localization assumes we are provided a ground truth image-phrase pair, and the goal is to find the bounding box for the phrase within the image. A localization is deemed successful if the predicted box has at least 0.5 IOU with the ground truth.

Datasets. We use two common phrase grounding datasets in our experiments. Our first dataset is Flickr30K Entities [flickrentitiesijcv] that consists of 276K bounding boxes in 32K images for the noun phrases associated with each image’s descriptive captions (5 per image) from the Flickr30K dataset [young2014image]. We use the splits of Plummer et al[flickrentitiesijcv] that consist of 30K/1K/1K train/test/validation images. Our second dataset is ReferIt [kazemzadeh-EtAl:2014:EMNLP2014], which consists of 20K images from the IAPR TC-12 dataset [Grubinger06theiapr] that have been augmented with 120K region descriptions. We use the splits of Hu et al[hu2015natural], which splits the train/val and testing sets evenly (i.e. 10K each). See Table IV for statistics on the number of phrases and instances in both datasets.

TABLE I: Phrase localization performance on the Flickr30k Entities test set. (a) State-of-the-art taken from prior work, (b) our compared approaches that use the same RPN and feature representation varying only the trained region classifier, and (c) benefits provided from initializing our classification layers using CCA.
Methods Accuracy
(a) State-of-the-art (VGG)
CITE (w/o RPN) [plummerCITE2017] 61.89
PGN + QRN [ChenICCV2017] 60.21
SeqGROUND (ResNet-50) [Dogan_2019_CVPR] 61.60
QA R-CNN [hinamiARXIV2017] 62.52
QA R-CNN + VG + OHEM [hinamiARXIV2017] 65.21
QA R-CNN + VG + OHEM + NPA [hinamiARXIV2017] 64.09
QRC [ChenICCV2017] 65.14
CITE (our implementation) 66.78
(b) Region Classification Method (ResNet)
QA R-CNN 71.11
QA R-CNN + NPA 69.73
CCA 67.05
Deep CCA 65.92
Embedding Network 70.88
CITE (our implementation) 71.52
(c) w/CCA Initialization (ResNet)
Embedding Network 71.36
CITE (our implementation) 71.70

5.1 Localization Results

Tables I and II present results on the Flickr30K Entities and ReferIt datasets, respectively. Parts (a) of both tables report numbers from prior work using a VGG backbone [simonyan2014very]. The last line of each subtable gives the performance of our implementation of the CITE network using RPN and BBReg. We can see that CITE obtains similar performance to other adaptations of the Faster R-CNN network [hinamiARXIV2017, ChenICCV2017]. Unlike those works, we do not need to train on outside vision-language datasets (i.e. Visual Genome (VG) [krishnavisualgenome]), perform online hard example mining (OHEM) [shrivastavaCVPR16ohem], or jointly predict multiple phrases in the same image at once [ChenICCV2017]. Thus, we can regard CITE as a good representative of the state-of-the-art in phrase localization.

Tables I(b) and II(b) present a comparison of different region classifiers on top of a ResNet backbone. First, we observe that adding NPA to QA R-CNN (discussed in Section 3.3.1) slightly decreases performance, which is consistent with the results in [hinamiARXIV2017]. NPA is designed to improve discrimination of similar phrases, which is largely unnecessary in localization since similar phrases rarely occur in the same image as we argued earlier. Second, Deep CCA performs worse than regular CCA. As discussed in Section 3.3.3, we conjecture that even a batch size of 30K is insufficient to get the CCA objective to generalize well (for comparison, for Flickr30K Entities we use 420K samples when training regular CCA). In addition, since Deep CCA requires a large batch size for training, fine-tuning the entire network is impractical (in this experiment all layers except those for the classifier are kept fixed). The last two lines of Tables I(b) and II(b) show the results of Embedding Network and CITE with random initialization. We can see that CITE outperforms the Embedding Network and is competitive with QA R-CNN.

Tables I(c) and II(c) report the performance of Embedding Network and CITE with CCA initialization. This initialization does not make much difference for localization (performance is improves slightly on Flickr30K Entities and decreased on ReferIt). However, we argue that the underlying model can better to discriminate between closely related phrases, the effect of which will be seen through consistent, larger increases in detection performance in Section 6.

TABLE II: Phrase localization performance on the ReferIt test set. (a) Published results of methods that fine-tune the visual representation, except for CITE which does no fine-tuning, (b) our compared approaches that use the same RPN and feature representation varying only the trained region classifier, and (c) benefits provided from initializing our classification layers using CCA.
Methods Accuracy
(a) State of the art (VGG)
CITE (w/o RPN, w/o finetuning)  [plummerCITE2017] 34.13
QRC [ChenICCV2017] 44.07
APML [Li:2017:DAM:3123266.3123439] 44.18
CITE (our implementation) 47.98
(b) Region Classification Method (ResNet)
QA R-CNN 55.81
QA R-CNN + NPA 54.67
CCA 49.53
Deep CCA 46.76
Embedding Network 52.27
CITE (our implementation) 54.04
(c) w/CCA Initialization (ResNet)
Embedding Network 51.50
CITE (our implementation) 53.19

Table III benchmarks CCA-initialized methods on new test sets created by augmenting positive phrases (discussed in Section 4.2), which helps reduce annotation sparsity. The relative performance of methods remains unchanged, but the drop in absolute performance suggests the standard benchmark may overestimate a model’s true performance since many of the alternative ways of referencing the same phrase result in phrases not being correctly localized, especially on Flickr30K Entities. The last two lines of Table III(b) compare randomly subsampling the augmented positive phrases during training (RS) with the IFS sampling method described in Section 4.3. Although IFS gives better performance than RS, it does result in a slight drop in performance compared with using all phrases. However, as we discuss in Section 4.3, this is due to overfitting to phrase localization when using all phrases, making it less able to identify if a phrase is relevant to an image. This will be clearly shown when evaluating phrase detection in the next section.

TABLE III: Phrase localization performance using augmented positive phrases (PPA) discussed in Section 4.2 for evaluation. (a) compares methods that are trained using the ground truth annotations and (b) reports the effect training with PPA has on performance. All methods use CCA as either the region classifier or for layer initialization.
Flickr30K Entities ReferIt Game
(a) w/o Train CCA 49.73 40.67
PPA Embedding Network 59.73 47.92
CITE 59.20 47.00
(b) w/Train CCA 55.04 47.29
PPA Embedding Network 62.35 52.95
CITE 61.65 50.03
CITE + RS 61.08 49.75
CITE + IFS 62.07 50.22

6 Phrase Detection Experiments

TABLE IV: mAP for the phrase detection task split by frequency of training instances. (a) our compared approaches that use the same RPN and feature representation varying only the trained region classifier, and (b) benefits from initializing our classification layers using CCA.
Flickr30K Entities ReferIt Game
zero-shot few-shot common mean/ zero-shot few-shot common mean/
#Train Samples Per Phrase 0 1-100 >100 total 0 1-100 >100 total
(a) Region QA R-CNN 3.9 4.3 8.9 5.7 0.3 0.7 11.9 4.3
Classification QA R-CNN + NPA 3.8 4.1 9.7 5.9 0.3 0.6 11.5 4.1
Method CCA 8.8 10.7 17.1 12.2 0.6 1.5 15.0 5.7
Deep CCA 6.4 7.5 14.9 9.6 0.3 1.1 13.2 4.9
Embedding Network 3.1 4.0 9.1 5.4 0.1 0.8 12.0 4.3
CITE 4.6 4.7 8.8 6.0 0.4 0.5 12.5 4.5
(b) w/CCA Embedding Network 9.2 10.3 17.2 12.3 0.5 1.6 15.0 5.7
Initialization CITE 9.6 11.5 17.1 12.7 0.7 2.0 15.3 6.0
#Categories 1,783 2,764 472 5,019 27,378 4,568 40 31,986
#Total Test Occurrences 1,860 4,373 8,248 14,481 29,304 21,850 14,039 65,193
TABLE V: mAP for the phrase detection task split by frequency of training instances where augmented positive phrases (PPA) discussed in Section 4.2 are used for evaluation. (a) compares methods that are trained only using the ground truth annotations and (b) reports the effect also training with PPA has on performance. All methods use CCA as either the region classifier or for layer initialization.
Flickr30K Entities ReferIt Game
zero-shot few-shot common mean/ zero-shot few-shot common mean/
#Train Samples Per Phrase 0 1-100 >100 total 0 1-100 >100 total
(a) w/o Train PPA CCA 8.9 10.7 18.9 12.9 0.6 1.5 13.8 5.3
Embedding Network 8.7 10.4 19.6 12.9 0.3 1.3 13.8 5.2
CITE 9.7 11.8 19.5 13.7 0.6 1.7 14.2 5.5
(b) w/Train PPA CCA 8.4 10.8 18.8 12.7 0.4 1.6 13.4 5.1
Embedding Network 8.3 10.3 19.8 12.8 0.4 1.3 14.9 5.5
CITE 9.0 11.2 20.3 13.5 0.6 1.7 14.3 5.6
CITE + RS 9.2 11.3 20.7 13.8 0.5 1.7 14.6 5.6
CITE + IFS 9.5 12.0 21.6 14.4 0.7 1.8 15.1 5.9
#Total Test Occurrences 2,679 27,327 41,274 71,280 55,587 56,016 17,743 129,346

Phrase detection is akin to object detection where the phrases can be thought of as categories. Thus, we evaluate this task using mean average precision (mAP) over the phrases. We found that keeping a single candidate per phrase per image performed the best in our experiments – i.e., every image predicts a single location for each phrase. When reporting performance, we also separate phrases based on the number of training instances, essentially breaking up the evaluation into zero-shot, few-shot (1-100), and common phrases (>100). Only ground truth annotations are used to calculate the number of training instances (i.e. without any data augmentation). An overall score is obtained by averaging these three mAP scores. In practice this gives higher emphasis to the common phrases that typically account for the majority of instances, but have few unique phrases, when evaluating a model.

6.1 Detection Results

Table IV reports performance on phrase detection on both Flickr30K Entities and ReferIt. When comparing the different classification methods in Table IV(a), we see that cross-correlation methods – i.e., CCA and Deep CCA – significantly outperform other approaches. This stands in direct contrast with the phrase localization experiments of Section 5, where they performed the worse, and provides additional evidence of prior work overfitting to phrase localization. However, as can be seen in Table IV(b), by fine-tuning the learned CCA weights, we can not only improve performance on phrase detection further, but, as we saw in Section 5, train a model which is competitive with the state-of-the-art in phrase localization. As we discussed in Section 3.3.1, and verified in Table IV(a), using NPA mostly affords benefits to common phrases . Thus, using NPA to obtain (noisy) hard negatives during training is not as effective as CCA for improving discrimination.

To visualize the advantage provided by CCA initialization, Figure 3 compares the confusion matrices for the top 20 person phrases for the Embedding Network classifier without and with CCA initialization. This verifies pure minibatch training without CCA initialization leads to a network that makes similar predictions for similar phrases, while the CCA-initialized network has much better fine-grained discrimination ability.

Table V reports the performance of CCA-initialized classifiers when using the PPA procedure from Section 4.2 to reduce annotation sparsity. We show that the relative performance of methods remains largely unchanged when using the same training strategies used in Table IV. However, in the last line of Table V(b) we see that training with inverse frequency sampling (IFS) to bias to selecting harder phrases during training, as described in Section 4.3, yields a consistent improvement over using all phrases, or randomly subsampling these phrases, just as it did with phrase localization as shown in Table III.


(a) No CCA Initialization\topinset(b) w/CCA Initialization-.0in0.8in-0.0in-2.7in

Fig. 3: Confusion matrix comparing the top 20 most common phrases referring to people (in order of number of instances) in the Flickr30K Entities test set for different versions of the Embedding Network classifier.
Fig. 4: Qualitative results comparing an Embedding Network classifier with and without CCA initialization. See text for discussion.

Finally, Figure 4 shows a qualitative comparison of the models without and with CCA initialization. In the leftmost example, before using CCA for layer initialization our model made similar predictions for the dog and child, while also confusing the child and the man, all of which we correctly identify using CCA initialized layers. Being able to correctly identify similar phrases is not restricted to references of “people,” however, as seen in the middle example of Figure 4, where the model before CCA initialization makes the same prediction for the dog and large brown white cow, but gets them correct with our full model. The third example of Figure 4 also makes several correct references with the CCA initialized model, including for the phrase someone even though that person is much less prominent than the woman. This suggests there may be some visual cues that may be useful in resolving pronominal references, and taking into account the bias of what and how entities are referenced may improve performance (e.g., human biases when writing the phrases as done in Misra et al[MisraNoisy16]).

6.2 Filtering Phrases

At test time, evaluating our phrase detection model for the entire phrase vocabulary may be too computationally expensive, and is likely to result in many false positive detections. To mitigate these issues, we can consider a filtering step in which we first use a global image representation to predict a short list of phrases likely to be in the image, and then selectively run our phrase detection model only on those phrases. To this end, we use the two-branch image-sentence retrieval approach of Wang et al[wangTwoBranch2017] trained on Flickr30K to retrieve the top 100 training sentences for each image in the test set. Then, for each image we extract phrases from the retrieved sentences and only run the detector models for the extracted phrases. For the text representation we use the same HGLMM features as our phrase detectors. For the whole-image representation, we use a 152-layer ResNet pretrained on ImageNet [deng2009imagenet] and averaged over 10 crops.

Table VI reports a consistent improvement from the above filtering procedure. However, a drawback of this approach is that it requires database sentences from a similar distribution of images. We also tried to generate captions using the Show and Tell [xu2015show] approach rather than retrieve them, but we found the generated captions provided low recall on the phrases in the test set, resulting in poor performance.

Retrieving sentences provides at least two constraints the phrase detection models lack. First, sentences capture some information about co-occurrences between phrases (e.g., you likely shouldn’t try to detect a hand if you don’t think an image contains a person). Secondly, these sentences give some measure of the prior probability of a phrase, i.e., we are unlikely to retrieve a phrase if it occurs once in the entire dataset unless we are relatively certain it exists. Incorporating such constraints in an end-to-end-trainable phrase detection framework is a good potential direction for future work on phrase detection.

TABLE VI: Effect the phrase filtering approach discussed in Section 6.2 has on the phrase detection task. Methods are evaluated on the Flickr30K Entities test set and include PPA for both training/testing.
#Train Occurrences zero-shot few-shot common mean/
Per Phrase 0 1-100 >100 total
w/o Phrase Filtering
Embedding Network 8.7 10.4 19.6 12.9
CITE 9.7 11.8 19.5 13.7
CITE + IFS 9.5 12.0 21.6 14.4
w/Phrase Filtering
Embedding Network 9.3 11.3 20.5 13.7
CITE 10.9 13.1 21.1 15.0
CITE + IFS 11.0 13.0 22.2 15.4

7 Conclusion

We introduced the phrase detection task, which is more challenging and has a broader set of applications than the localization-only problem addressed in prior work. Our experiments show that state-of-the-art localization models tend to have difficulty inferring the presence of phrases in an image compared to seemingly simpler methods like CCA. Nevertheless, by fine-tuning a CCA-initialized model with negative samples we obtain the best results on phrase detection, while also being competitive with the state-of-the-art on phrase localization. However, our models still perform relatively poorly compared to models for tasks like object detection, indicating substantial room for improvement in future work. A significant challenge of phrase detection stems from the long tail of phrases that occur only a few times. As discussed in Section 6.2, improvement could come from jointly predicting multiple phrases at a time while also taking into account how common a phrase is. We believe improving negative sampling methods could have a significant impact on performance in future work.


This work is supported in part by DARPA and NSF awards IIS-1724237, CNS-1629700, CCF-1723379, IIS-1718221, and IIS-1563727. The authors would like to thank Karen Livescu for helpful discussions.