Abstract
Product retrieval is of great importance in the ecommerce domain. This paperintroduces our 1st-place solution in eBay eProduct Visual Search Challenge(FGVC9), which is featured for an ensemble of about 20 models from visionmodels and vision-language models. While model ensemble is common, we show thatcombining the vision models and vision-language models brings particularbenefits from their complementarity and is a key factor to our superiority.Specifically, for the vision models, we use a two-stage training pipeline whichfirst learns from the coarse labels provided in the training set and thenconducts fine-grained self-supervised training, yielding a coarse-to-finemetric learning manner. For the vision-language models, we use the textualdescription of the training image as the supervision signals for fine-tuningthe image-encoder (feature extractor). With these designs, our solutionachieves 0.7623 MAR@10, ranking the first place among all the competitors. Thecode is available at: \href{https://github.com/WangWenhao0716/V2L}{V$^2$L}.