Abstract
Vision-language models like CLIP excel at recognizing the single, prominentobject in a scene. However, they struggle in complex scenes containing multipleobjects. We identify a fundamental reason behind this limitation: VLMs featuresspace exhibits significant semantic entanglement, where features of one classcontain substantial information about other unrelated classes, a phenomenon weterm mutual feature information (MFI). This entanglement becomes evident duringclass-specific queries, as unrelated objects are activated alongside thequeried class. To address this limitation, we propose DCLIP, a framework thatdisentangles CLIP features using two complementary objectives: a novel MFI Lossthat orthogonalizes the text (class) features to reduce inter-class similarity,and the Asymmetric Loss (ASL) that aligns image features with the disentangledtext features. Our experiment demonstrates that DCLIP reduces inter-classfeature similarity by 30\% compared to CLIP, leading to significant performancegains on multi-label recognition (MLR) and zero-shot semantic segmentation(ZS3). In MLR, DCLIP outperforms SOTA approaches on VOC2007 and COCO-14 whileusing 75\% fewer parameters, and surpasses SOTA ZS3 methods by 3.4 mIoU onVOC2012 and 2.8 mIoU on COCO-17. These results establish featuredisentanglement as a critical factor for effective multi-object perception invision-language models.