Seeing the Abstract: Translating the Abstract Language for Vision Language Models

Abstract

Natural language goes beyond dryly describing visual content. It containsrich abstract concepts to express feeling, creativity and properties thatcannot be directly perceived. Yet, current research in Vision Language Models(VLMs) has not shed light on abstract-oriented language. Our research breaksnew ground by uncovering its wide presence and under-estimated value, withextensive analysis. Particularly, we focus our investigation on the fashiondomain, a highly-representative field with abstract expressions. By analyzingrecent large-scale multimodal fashion datasets, we find that abstract termshave a dominant presence, rivaling the concrete ones, providing novelinformation, and being useful in the retrieval task. However, a criticalchallenge emerges: current general-purpose or fashion-specific VLMs arepre-trained with databases that lack sufficient abstract words in their textcorpora, thus hindering their ability to effectively representabstract-oriented language. We propose a training-free and model-agnosticmethod, Abstract-to-Concrete Translator (ACT), to shift abstractrepresentations towards well-represented concrete ones in the VLM latent space,using pre-trained models and existing multimodal databases. On thetext-to-image retrieval task, despite being training-free, ACT outperforms thefine-tuned VLMs in both same- and cross-dataset settings, exhibiting itseffectiveness with a strong generalization capability. Moreover, theimprovement introduced by ACT is consistent with various VLMs, making it aplug-and-play solution.

Quick Read (beta)

loading the full paper ...