Learning to Model the World with Language

Abstract

To interact with humans and act in the world, agents need to understand therange of language that people use and relate it to the visual world. Whilecurrent agents can learn to execute simple language instructions, we aim tobuild agents that leverage diverse language -- language like "this button turnson the TV" or "I put the bowls away" -- that conveys general knowledge,describes the state of the world, provides interactive feedback, and more. Ourkey idea is that agents should interpret such diverse language as a signal thathelps them predict the future: what they will observe, how the world willbehave, and which situations will be rewarded. This perspective unifieslanguage understanding with future prediction as a powerful self-supervisedlearning objective. We instantiate this in Dynalang, an agent that learns amultimodal world model to predict future text and image representations, andlearns to act from imagined model rollouts. While current methods that learnlanguage-conditioned policies degrade in performance with more diverse types oflanguage, we show that Dynalang learns to leverage environment descriptions,game rules, and instructions to excel on tasks ranging from game-playing tonavigating photorealistic home scans. Finally, we show that our method enablesadditional capabilities due to learning a generative model: Dynalang can bepretrained on text-only data, enabling learning from offline datasets, andgenerate language grounded in an environment.

Quick Read (beta)

loading the full paper ...