Abstract
The development of large vision-language models (LVLMs) offers the potentialto address challenges faced by traditional multimodal recommendations thanks totheir proficient understanding of static images and textual dynamics. However,the application of LVLMs in this field is still limited due to the followingcomplexities: First, LVLMs lack user preference knowledge as they are trainedfrom vast general datasets. Second, LVLMs suffer setbacks in addressingmultiple image dynamics in scenarios involving discrete, noisy, and redundantimage sequences. To overcome these issues, we propose the novel reasoningscheme named Rec-GPT4V: Visual-Summary Thought (VST) of leveraging largevision-language models for multimodal recommendation. We utilize user historyas in-context user preferences to address the first challenge. Next, we promptLVLMs to generate item image summaries and utilize image comprehension innatural language space combined with item titles to query the user preferencesover candidate items. We conduct comprehensive experiments across four datasetswith three LVLMs: GPT4-V, LLaVa-7b, and LLaVa-13b. The numerical resultsindicate the efficacy of VST.