Natural Language Inference Improves Compositionality in Vision-Language Models

Abstract

Compositional reasoning in Vision-Language Models (VLMs) remains challengingas these models often struggle to relate objects, attributes, and spatialrelationships. Recent methods aim to address these limitations by relying onthe semantics of the textual description, using Large Language Models (LLMs) tobreak them down into subsets of questions and answers. However, these methodsprimarily operate on the surface level, failing to incorporate deeper lexicalunderstanding while introducing incorrect assumptions generated by the LLM. Inresponse to these issues, we present Caption Expansion with Contradictions andEntailments (CECE), a principled approach that leverages Natural LanguageInference (NLI) to generate entailments and contradictions from a givenpremise. CECE produces lexically diverse sentences while maintaining their coremeaning. Through extensive experiments, we show that CECE enhancesinterpretability and reduces overreliance on biased or superficial features. Bybalancing CECE along the original premise, we achieve significant improvementsover previous methods without requiring additional fine-tuning, producingstate-of-the-art results on benchmarks that score agreement with humanjudgments for image-text alignment, and achieving an increase in performance onWinoground of +19.2% (group score) and +12.9% on EqBen (group score) over thebest prior work (finetuned with targeted data).

Quick Read (beta)

loading the full paper ...