Abstract
The success of large-scale contextual language models has attracted greatinterest in probing what is encoded in their representations. In this work, weconsider a new question: to what extent contextual representations of concretenouns are aligned with corresponding visual representations? We design aprobing model that evaluates how effective are text-only representations indistinguishing between matching and non-matching visual representations. Ourfindings show that language representations alone provide a strong signal forretrieving image patches from the correct object categories. Moreover, they areeffective in retrieving specific instances of image patches; textual contextplays an important role in this process. Visually grounded language modelsslightly outperform text-only language models in instance retrieval, butgreatly under-perform humans. We hope our analyses inspire future research inunderstanding and improving the visual capabilities of language models.