Grounded Language Learning Fast and Slow

Abstract

Recent work has shown that large text-based neural language models, trainedwith conventional supervised learning objectives, acquire a surprisingpropensity for few- and one-shot learning. Here, we show that an embodied agentsituated in a simulated 3D world, and endowed with a novel dual-coding externalmemory, can exhibit similar one-shot word learning when trained withconventional reinforcement learning algorithms. After a single introduction toa novel object via continuous visual perception and a language prompt ("This isa dax"), the agent can re-identify the object and manipulate it as instructed("Put the dax on the bed"). In doing so, it seamlessly integrates short-term,within-episode knowledge of the appropriate referent for the word "dax" withlong-term lexical and motor knowledge acquired across episodes (i.e. "bed" and"putting"). We find that, under certain training conditions and with aparticular memory writing mechanism, the agent's one-shot word-object bindinggeneralizes to novel exemplars within the same ShapeNet category, and iseffective in settings with unfamiliar numbers of objects. We further show howdual-coding memory can be exploited as a signal for intrinsic motivation,stimulating the agent to seek names for objects that may be useful for laterexecuting instructions. Together, the results demonstrate that deep neuralnetworks can exploit meta-learning, episodic memory and an explicitlymulti-modal environment to account for 'fast-mapping', a fundamental pillar ofhuman cognitive development and a potentially transformative capacity foragents that interact with human users.

Quick Read (beta)

loading the full paper ...