Investigating the Vision Transformer Model for Image Retrieval Tasks

Abstract

This paper introduces a plug-and-play descriptor that can be effectivelyadopted for image retrieval tasks without prior initialization or preparation.The description method utilizes the recently proposed Vision Transformernetwork while it does not require any training data to adjust parameters. Inimage retrieval tasks, the use of Handcrafted global and local descriptors hasbeen very successfully replaced, over the last years, by the ConvolutionalNeural Networks (CNN)-based methods. However, the experimental evaluationconducted in this paper on several benchmarking datasets against 36state-of-the-art descriptors from the literature demonstrates that a neuralnetwork that contains no convolutional layer, such as Vision Transformer, canshape a global descriptor and achieve competitive results. As fine-tuning isnot required, the presented methodology's low complexity encourages adoption ofthe architecture as an image retrieval baseline model, replacing thetraditional and well adopted CNN-based approaches and inaugurating a new era inimage retrieval approaches.

Quick Read (beta)

loading the full paper ...