Abstract
In recent years, several machine learning models have been proposed. They aretrained with a language modelling objective on large-scale text-only data. Withsuch pretraining, they can achieve impressive results on many Natural LanguageUnderstanding and Generation tasks. However, many facets of meaning cannot belearned by ``listening to the radio" only. In the literature, manyVision+Language (V+L) tasks have been defined with the aim of creating modelsthat can ground symbols in the visual modality. In this work, we provide asystematic literature review of several tasks and models proposed in the V+Lfield. We rely on Wittgenstein's idea of `language games' to categorise suchtasks into 3 different families: 1) discriminative games, 2) generative games,and 3) interactive games. Our analysis of the literature provides evidence thatfuture work should be focusing on interactive games where communication inNatural Language is important to resolve ambiguities about object referents andaction plans and that physical embodiment is essential to understand thesemantics of situations and events. Overall, these represent key requirementsfor developing grounded meanings in neural models.