Abstract
Transformers have shown outstanding results for natural languageunderstanding and, more recently, for image classification. We here extend thiswork and propose a transformer-based approach for image retrieval: we adoptvision transformers for generating image descriptors and train the resultingmodel with a metric learning objective, which combines a contrastive loss witha differential entropy regularizer. Our results show consistent and significantimprovements of transformers over convolution-based approaches. In particular,our method outperforms the state of the art on several public benchmarks forcategory-level retrieval, namely Stanford Online Product, In-Shop and CUB-200.Furthermore, our experiments on ROxford and RParis also show that, incomparable settings, transformers are competitive for particular objectretrieval, especially in the regime of short vector representations andlow-resolution images.