Abstract
Today's most accurate language models are trained on orders of magnitude morelanguage data than human language learners receive - but with no supervisionfrom other sensory modalities that play a crucial role in human learning. Canwe make LMs' representations and predictions more accurate (and morehuman-like) with more ecologically plausible supervision? This paper describesLexiContrastive Grounding (LCG), a grounded language learning procedure thatleverages visual supervision to improve textual representations.LexiContrastive Grounding combines a next token prediction strategy with acontrastive visual grounding objective, focusing on early-layer representationsthat encode lexical information. Across multiple word-learning andsentence-understanding benchmarks, LexiContrastive Grounding not onlyoutperforms standard language-only models in learning efficiency, but alsoimproves upon vision-and-language learning procedures including CLIP, GIT,Flamingo, and Vokenization. Moreover, LexiContrastive Grounding improvesperplexity by around 5% on multiple language modeling tasks. This workunderscores the potential of incorporating visual grounding into languagemodels, aligning more closely with the multimodal nature of human languageacquisition.