StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners

Abstract

We investigate the potential of learning visual representations usingsynthetic images generated by text-to-image models. This is a natural questionin the light of the excellent performance of such models in generatinghigh-quality images. We consider specifically the Stable Diffusion, one of theleading open source text-to-image models. We show that (1) when the generativemodel is configured with proper classifier-free guidance scale, trainingself-supervised methods on synthetic images can match or beat the real imagecounterpart; (2) by treating the multiple images generated from the same textprompt as positives for each other, we develop a multi-positive contrastivelearning method, which we call StableRep. With solely synthetic images, therepresentations learned by StableRep surpass the performance of representationslearned by SimCLR and CLIP using the same set of text prompts and correspondingreal images, on large scale datasets. When we further add language supervision,StableRep trained with 20M synthetic images achieves better accuracy than CLIPtrained with 50M real images.

Quick Read (beta)

loading the full paper ...