Classification Done Right for Vision-Language Pre-Training

Abstract

We introduce SuperClass, a super simple classification method forvision-language pre-training on image-text data. Unlike its contrastivecounterpart CLIP who contrast with a text encoder, SuperClass directly utilizestokenized raw text as supervised classification labels, without the need foradditional text filtering or selection. Due to the absence of the text encodingas contrastive target, SuperClass does not require a text encoder and does notneed to maintain a large batch size as CLIP does. SuperClass demonstratedsuperior performance on various downstream tasks, including classic computervision benchmarks and vision language downstream tasks. We further explored thescaling behavior of SuperClass on model size, training length, or data size,and reported encouraging results and comparisons to CLIP.https://github.com/x-cls/superclass

Quick Read (beta)

loading the full paper ...