Can GPT-4o mini and Gemini 2.0 Flash Predict Fine-Grained Fashion Product Attributes? A Zero-Shot Analysis

Abstract

The fashion retail business is centered around the capacity to comprehendproducts. Product attribution helps in comprehending products depending on thebusiness process. Quality attribution improves the customer experience as theynavigate through millions of products offered by a retail website. It leads towell-organized product catalogs. In the end, product attribution directlyimpacts the 'discovery experience' of the customer. Although large languagemodels (LLMs) have shown remarkable capabilities in understanding multimodaldata, their performance on fine-grained fashion attribute recognition remainsunder-explored. This paper presents a zero-shot evaluation of state-of-the-artLLMs that balance performance with speed and cost efficiency, mainlyGPT-4o-mini and Gemini 2.0 Flash. We have used the datasetDeepFashion-MultiModal (https://github.com/yumingj/DeepFashion-MultiModal) toevaluate these models in the attribution tasks of fashion products. Our studyevaluates these models across 18 categories of fashion attributes, offeringinsight into where these models excel. We only use images as the sole input forproduct information to create a constrained environment. Our analysis showsthat Gemini 2.0 Flash demonstrates the strongest overall performance with amacro F1 score of 56.79% across all attributes, while GPT-4o-mini scored amacro F1 score of 43.28%. Through detailed error analysis, our findings providepractical insights for deploying these LLMs in production e-commerce productattribution-related tasks and highlight the need for domain-specificfine-tuning approaches. This work also lays the groundwork for future researchin fashion AI and multimodal attribute extraction.

Quick Read (beta)

loading the full paper ...