Iterated Learning Improves Compositionality in Large Vision-Language Models

Abstract

A fundamental characteristic common to both human vision and natural languageis their compositional nature. Yet, despite the performance gains contributedby large vision and language pretraining, recent investigations find thatmost-if not all-our state-of-the-art vision-language models struggle atcompositionality. They are unable to distinguish between images of " a girl inwhite facing a man in black" and "a girl in black facing a man in white".Moreover, prior work suggests that compositionality doesn't arise with scale:larger model sizes or training data don't help. This paper develops a newiterated training algorithm that incentivizes compositionality. We draw ondecades of cognitive science research that identifies cultural transmission-theneed to teach a new generation-as a necessary inductive prior that incentivizeshumans to develop compositional languages. Specifically, we reframevision-language contrastive learning as the Lewis Signaling Game between avision agent and a language agent, and operationalize cultural transmission byiteratively resetting one of the agent's weights during training. After everyiteration, this training paradigm induces representations that become "easierto learn", a property of compositional languages: e.g. our model trained onCC3M and CC12M improves standard CLIP by 4.7%, 4.0% respectfully in theSugarCrepe benchmark.

Quick Read (beta)

loading the full paper ...