Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling

Abstract

Today's most accurate language models are trained on orders of magnitude morelanguage data than human language learners receive - but with no supervisionfrom other sensory modalities that play a crucial role in human learning. Canwe make LMs' representations and predictions more accurate (and morehuman-like) with more ecologically plausible supervision? This paper describesLexiContrastive Grounding (LCG), a grounded language learning procedure thatleverages visual supervision to improve textual representations.LexiContrastive Grounding combines a next token prediction strategy with acontrastive visual grounding objective, focusing on early-layer representationsthat encode lexical information. Across multiple word-learning andsentence-understanding benchmarks, LexiContrastive Grounding not onlyoutperforms standard language-only models in learning efficiency, but alsoimproves upon vision-and-language learning procedures including CLIP, GIT,Flamingo, and Vokenization. Moreover, LexiContrastive Grounding improvesperplexity by around 5% on multiple language modeling tasks. This workunderscores the potential of incorporating visual grounding into languagemodels, aligning more closely with the multimodal nature of human languageacquisition.

Quick Read (beta)

loading the full paper ...