Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval

Abstract

The pre-trained vision and language (V\&L) models have substantially improvedthe performance of cross-modal image-text retrieval. In general, however, V\&Lmodels have limited retrieval performance for small objects because of therough alignment between words and the small objects in the image. In contrast,it is known that human cognition is object-centric, and we pay more attentionto important objects, even if they are small. To bridge this gap between thehuman cognition and the V\&L model's capability, we propose a cross-modalimage-text retrieval framework based on ``object-aware query perturbation.''The proposed method generates a key feature subspace of the detected objectsand perturbs the corresponding queries using this subspace to improve theobject awareness in the image. In our proposed method, object-aware cross-modalimage-text retrieval is possible while keeping the rich expressive power andretrieval performance of existing V\&L models without additional fine-tuning.Comprehensive experiments on four public datasets show that our methodoutperforms conventional algorithms.

Quick Read (beta)

loading the full paper ...