Language-only Efficient Training of Zero-shot Composed Image Retrieval

Abstract

Composed image retrieval (CIR) task takes a composed query of image and text,aiming to search relative images for both conditions. Conventional CIRapproaches need a training dataset composed of triplets of query image, querytext, and target image, which is very expensive to collect. Several recentworks have worked on the zero-shot (ZS) CIR paradigm to tackle the issuewithout using pre-collected triplets. However, the existing ZS-CIR methods showlimited backbone scalability and generalizability due to the lack of diversityof the input texts during training. We propose a novel CIR framework, onlyusing language for its training. Our LinCIR (Language-only training for CIR)can be trained only with text datasets by a novel self-supervision namedself-masking projection (SMP). We project the text latent embedding to thetoken embedding space and construct a new text by replacing the keyword tokensof the original text. Then, we let the new and original texts have the samelatent embedding vector. With this simple strategy, LinCIR is surprisinglyefficient and highly effective; LinCIR with CLIP ViT-G backbone is trained in48 minutes and shows the best ZS-CIR performances on four different CIRbenchmarks, CIRCO, GeneCIS, FashionIQ, and CIRR, even outperforming supervisedmethod on FashionIQ. Code is available at https://github.com/navervision/lincir

Quick Read (beta)

loading the full paper ...