Multi-Class Textual-Inversion Secretly Yields a Semantic-Agnostic Classifier

Abstract

With the advent of large pre-trained vision-language models such as CLIP,prompt learning methods aim to enhance the transferability of the CLIP model.They learn the prompt given few samples from the downstream task given thespecific class names as prior knowledge, which we term as semantic-awareclassification. However, in many realistic scenarios, we only have access tofew samples and knowledge of the class names (e.g., when considering instancesof classes). This challenging scenario represents the semantic-agnosticdiscriminative case. Text-to-Image (T2I) personalization methods aim to adaptT2I models to unseen concepts by learning new tokens and endowing these tokenswith the capability of generating the learned concepts. These methods do notrequire knowledge of class names as a semantic-aware prior. Therefore, in thispaper, we first explore Textual Inversion and reveal that the new concepttokens possess both generation and classification capabilities by regardingeach category as a single concept. However, learning classifiers fromsingle-concept textual inversion is limited since the learned tokens aresuboptimal for the discriminative tasks. To mitigate this issue, we proposeMulti-Class textual inversion, which includes a discriminative regularizationterm for the token updating process. Using this technique, our method MC-TIachieves stronger Semantic-Agnostic Classification while preserving thegeneration capability of these modifier tokens given only few samples percategory. In the experiments, we extensively evaluate MC-TI on 12 datasetscovering various scenarios, which demonstrates that MC-TI achieves superiorresults in terms of both classification and generation outcomes.

Quick Read (beta)

loading the full paper ...