Abstract
The success of large-scale contrastive vision-language pretraining (CLIP) hasbenefited both visual recognition and multimodal content understanding. Theconcise design brings CLIP the advantage in inference efficiency against othervision-language models with heavier cross-attention fusion layers, making it apopular choice for a wide spectrum of downstream tasks. However, CLIP does notexplicitly capture the hierarchical nature of high-level and fine-grainedsemantics conveyed in images and texts, which is arguably critical tovision-language understanding and reasoning. To this end, we equip both thevisual and language branches in CLIP with hierarchy-aware attentions, namelyHierarchy-aware CLIP (HiCLIP), to progressively discover semantic hierarchieslayer-by-layer from both images and texts in an unsupervised manner. As aresult, such hierarchical aggregation significantly improves the cross-modalalignment. To demonstrate the advantages of HiCLIP, we conduct qualitativeanalysis on its unsupervised hierarchy induction during inference, as well asextensive quantitative experiments on both visual recognition andvision-language downstream tasks.