Abstract
Multi-modal reasoning systems rely on a pre-trained object detector toextract regions of interest from the image. However, this crucial module istypically used as a black box, trained independently of the downstream task andon a fixed vocabulary of objects and attributes. This makes it challenging forsuch systems to capture the long tail of visual concepts expressed in free formtext. In this paper we propose MDETR, an end-to-end modulated detector thatdetects objects in an image conditioned on a raw text query, like a caption ora question. We use a transformer-based architecture to reason jointly over textand image by fusing the two modalities at an early stage of the model. Wepre-train the network on 1.3M text-image pairs, mined from pre-existingmulti-modal datasets having explicit alignment between phrases in text andobjects in the image. We then fine-tune on several downstream tasks such asphrase grounding, referring expression comprehension and segmentation,achieving state-of-the-art results on popular benchmarks. We also investigatethe utility of our model as an object detector on a given label set whenfine-tuned in a few-shot setting. We show that our pre-training approachprovides a way to handle the long tail of object categories which have very fewlabelled instances. Our approach can be easily extended for visual questionanswering, achieving competitive performance on GQA and CLEVR. The code andmodels are available at https://github.com/ashkamath/mdetr.