Abstract
The inherent ambiguity in defining visual concepts poses significantchallenges for modern generative models, such as the diffusion-basedText-to-Image (T2I) models, in accurately learning concepts from a singleimage. Existing methods lack a systematic way to reliably extract theinterpretable underlying intrinsic concepts. To address this challenge, wepresent ICE, short for Intrinsic Concept Extraction, a novel framework thatexclusively utilises a T2I model to automatically and systematically extractintrinsic concepts from a single image. ICE consists of two pivotal stages. Inthe first stage, ICE devises an automatic concept localization module topinpoint relevant text-based concepts and their corresponding masks within theimage. This critical stage streamlines concept initialization and providesprecise guidance for subsequent analysis. The second stage delves deeper intoeach identified mask, decomposing the object-level concepts into intrinsicconcepts and general concepts. This decomposition allows for a more granularand interpretable breakdown of visual elements. Our framework demonstratessuperior performance on intrinsic concept extraction from a single image in anunsupervised manner. Project page: https://visual-ai.github.io/ice