A Simple Long-Tailed Recognition Baseline via Vision-Language Model

Abstract

The visual world naturally exhibits a long-tailed distribution of openclasses, which poses great challenges to modern visual systems. Existingapproaches either perform class re-balancing strategies or directly improvenetwork modules to address the problem. However, they still train models with afinite set of predefined labels, limiting their supervision information andrestricting their transferability to novel instances. Recent advances inlarge-scale contrastive visual-language pretraining shed light on a new pathwayfor visual recognition. With open-vocabulary supervisions, pretrainedcontrastive vision-language models learn powerful multimodal representationsthat are promising to handle data deficiency and unseen concepts. Bycalculating the semantic similarity between visual and text inputs, visualrecognition is converted to a vision-language matching problem. Inspired bythis, we propose BALLAD to leverage contrastive vision-language models forlong-tailed recognition. We first continue pretraining the vision-languagebackbone through contrastive learning on a specific long-tailed target dataset.Afterward, we freeze the backbone and further employ an additional adapterlayer to enhance the representations of tail classes on balanced trainingsamples built with re-sampling strategies. Extensive experiments have beenconducted on three popular long-tailed recognition benchmarks. As a result, oursimple and effective approach sets the new state-of-the-art performances andoutperforms competitive baselines with a large margin. Code is released athttps://github.com/gaopengcuhk/BALLAD.

Quick Read (beta)

loading the full paper ...