PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning

Abstract

While the Contrastive Language-Image Pretraining(CLIP) model has achievedremarkable success in a variety of downstream vison language understandingtasks, enhancing its capability for fine-grained image-text alignment remainsan active research focus. To this end, most existing works adopt the strategyof explicitly increasing the granularity of visual information processing,e.g., incorporating visual prompts to guide the model focus on specific localregions within the image. Meanwhile, researches on Multimodal Large LanguageModels(MLLMs) have demonstrated that training with long and detailed textualdescriptions can effectively improve the model's fine-grained vision-languagealignment. However, the inherent token length limitation of CLIP's text encoderfundamentally limits CLIP to process more granular textual information embeddedin long text sequences. To synergistically leverage the advantages of enhancingboth visual and textual content processing granularity, we propose PixCLIP, anovel framework designed to concurrently accommodate visual prompt inputs andprocess lengthy textual descriptions. Specifically, we first establish anautomated annotation pipeline capable of generating pixel-level localized,long-form textual descriptions for images. Utilizing this pipeline, weconstruct LongGRIT, a high-quality dataset comprising nearly 1.5 millionsamples. Secondly, we replace CLIP's original text encoder with the LLM andpropose a three-branch pixel-text alignment learning framework, facilitatingfine-grained alignment between image regions and corresponding textualdescriptions at arbitrary granularity. Experiments demonstrate that PixCLIPshowcases breakthroughs in pixel-level interaction and handling long-formtexts, achieving state-of-the-art performance.

Quick Read (beta)

loading the full paper ...