Tag2Text: Guiding Vision-Language Model via Image Tagging

Abstract

This paper presents Tag2Text, a vision language pre-training (VLP) framework,which introduces image tagging into vision-language models to guide thelearning of visual-linguistic features. In contrast to prior works whichutilize object tags either manually labeled or automatically detected with alimited detector, our approach utilizes tags parsed from its paired text tolearn an image tagger and meanwhile provides guidance to vision-languagemodels. Given that, Tag2Text can utilize large-scale annotation-free image tagsin accordance with image-text pairs, and provides more diverse tag categoriesbeyond objects. As a result, Tag2Text achieves a superior image tag recognitionability by exploiting fine-grained text information. Moreover, by leveragingtagging guidance, Tag2Text effectively enhances the performance ofvision-language models on both generation-based and alignment-based tasks.Across a wide range of downstream benchmarks, Tag2Text achievesstate-of-the-art or competitive results with similar model sizes and datascales, demonstrating the efficacy of the proposed tagging guidance.

Quick Read (beta)

loading the full paper ...