LanGWM: Language Grounded World Model

Abstract

Recent advances in deep reinforcement learning have showcased its potentialin tackling complex tasks. However, experiments on visual control tasks haverevealed that state-of-the-art reinforcement learning models struggle without-of-distribution generalization. Conversely, expressing higher-levelconcepts and global contexts is relatively easy using language. Building upon recent success of the large language models, our main objectiveis to improve the state abstraction technique in reinforcement learning byleveraging language for robust action selection. Specifically, we focus onlearning language-grounded visual features to enhance the world model learning,a model-based reinforcement learning technique. To enforce our hypothesis explicitly, we mask out the bounding boxes of a fewobjects in the image observation and provide the text prompt as descriptionsfor these masked objects. Subsequently, we predict the masked objects alongwith the surrounding regions as pixel reconstruction, similar to thetransformer-based masked autoencoder approach. Our proposed LanGWM: Language Grounded World Model achieves state-of-the-artperformance in out-of-distribution test at the 100K interaction stepsbenchmarks of iGibson point navigation tasks. Furthermore, our proposedtechnique of explicit language-grounded visual representation learning has thepotential to improve models for human-robot interaction because our extractedvisual features are language grounded.

Quick Read (beta)

loading the full paper ...