Abstract
We present a novel task and dataset for evaluating the ability of vision andlanguage models to conduct visio-linguistic compositional reasoning, which wecall Winoground. Given two images and two captions, the goal is to match themcorrectly - but crucially, both captions contain a completely identical set ofwords, only in a different order. The dataset was carefully hand-curated byexpert annotators and is labeled with a rich set of fine-grained tags to assistin analyzing model performance. We probe a diverse range of state-of-the-artvision and language models and find that, surprisingly, none of them do muchbetter than chance. Evidently, these models are not as skilled atvisio-linguistic compositional reasoning as we might have hoped. We perform anextensive analysis to obtain insights into how future work might try tomitigate these models' shortcomings. We aim for Winoground to serve as a usefulevaluation set for advancing the state of the art and driving further progressin the field. The dataset is available athttps://huggingface.co/datasets/facebook/winoground.