"This is my unicorn, Fluffy": Personalizing frozen vision-language representations

Abstract

Large Vision & Language models pretrained on web-scale data providerepresentations that are invaluable for numerous V&L problems. However, it isunclear how they can be used for reasoning about user-specific visual conceptsin unstructured language. This problem arises in multiple domains, frompersonalized image retrieval to personalized interaction with smart devices. Weintroduce a new learning setup called Personalized Vision & Language (PerVL)with two new benchmark datasets for retrieving and segmenting user-specific"personalized" concepts "in the wild". In PerVL, one should learn personalizedconcepts (1) independently of the downstream task (2) allowing a pretrainedmodel to reason about them with free language, and (3) does not requirepersonalized negative examples. We propose an architecture for solving PerVLthat operates by extending the input vocabulary of a pretrained model with newword embeddings for the new personalized concepts. The model can then reasonabout them by simply using them in a sentence. We demonstrate that our approachlearns personalized visual concepts from a few examples and can effectivelyapply them in image retrieval and semantic segmentation using rich textualqueries.

Quick Read (beta)

loading the full paper ...