Bridging the gap to real-world language-grounded visual concept learning

Abstract

Human intelligence effortlessly interprets visual scenes along a richspectrum of semantic dimensions. However, existing approaches tolanguage-grounded visual concept learning are limited to a few predefinedprimitive axes, such as color and shape, and are typically explored insynthetic datasets. In this work, we propose a scalable framework thatadaptively identifies image-related concept axes and grounds visual conceptsalong these axes in real-world scenes. Leveraging a pretrained vision-languagemodel and our universal prompting strategy, our framework identifies a diverseimage-related axes without any prior knowledge. Our universal concept encoderadaptively binds visual features to the discovered axes without introducingadditional model parameters for each concept. To ground visual concepts alongthe discovered axes, we optimize a compositional anchoring objective, whichensures that each axis can be independently manipulated without affectingothers. We demonstrate the effectiveness of our framework on subsets ofImageNet, CelebA-HQ, and AFHQ, showcasing superior editing capabilities acrossdiverse real-world concepts that are too varied to be manually predefined. Ourmethod also exhibits strong compositional generalization, outperformingexisting visual concept learning and text-based editing methods. The code isavailable at https://github.com/whieya/Language-grounded-VCL.

Quick Read (beta)

loading the full paper ...