Learning to Model the World with Language

Abstract

To interact with humans in the world, agents need to understand the diversetypes of language that people use, relate them to the visual world, and actbased on them. While current agents learn to execute simple languageinstructions from task rewards, we aim to build agents that leverage diverselanguage that conveys general knowledge, describes the state of the world,provides interactive feedback, and more. Our key idea is that language helpsagents predict the future: what will be observed, how the world will behave,and which situations will be rewarded. This perspective unifies languageunderstanding with future prediction as a powerful self-supervised learningobjective. We present Dynalang, an agent that learns a multimodal world modelthat predicts future text and image representations and learns to act fromimagined model rollouts. Unlike traditional agents that use language only topredict actions, Dynalang acquires rich language understanding by using pastlanguage also to predict future language, video, and rewards. In addition tolearning from online interaction in an environment, Dynalang can be pretrainedon datasets of text, video, or both without actions or rewards. From usinglanguage hints in grid worlds to navigating photorealistic scans of homes,Dynalang utilizes diverse types of language to improve task performance,including environment descriptions, game rules, and instructions.

Quick Read (beta)

loading the full paper ...