Abstract
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen combinationsof known objects and attributes by leveraging knowledge from previously seencompositions. Traditional approaches primarily focus on disentanglingattributes and objects, treating them as independent entities during learning.However, this assumption overlooks the semantic constraints and contextualdependencies inside a composition. For example, certain attributes naturallypair with specific objects (e.g., "striped" applies to "zebra" or "shirts" butnot "sky" or "water"), while the same attribute can manifest differentlydepending on context (e.g., "young" in "young tree" vs. "young dog"). Thus,capturing attribute-object interdependence remains a fundamental yetlong-ignored challenge in CZSL. In this paper, we adopt a ConditionalProbability Framework (CPF) to explicitly model attribute-object dependencies.We decompose the probability of a composition into two components: thelikelihood of an object and the conditional likelihood of its attribute. Toenhance object feature learning, we incorporate textual descriptors tohighlight semantically relevant image regions. These enhanced object featuresthen guide attribute learning through a cross-attention mechanism, ensuringbetter contextual alignment. By jointly optimizing object likelihood andconditional attribute likelihood, our method effectively captures compositionaldependencies and generalizes well to unseen compositions. Extensive experimentson multiple CZSL benchmarks demonstrate the superiority of our approach. Codeis available at here.