SILG: The Multi-environment Symbolic Interactive Language Grounding Benchmark

Abstract

Existing work in language grounding typically study single environments. Howdo we build unified models that apply across multiple environments? We proposethe multi-environment Symbolic Interactive Language Grounding benchmark (SILG),which unifies a collection of diverse grounded language learning environmentsunder a common interface. SILG consists of grid-world environments that requiregeneralization to new dynamics, entities, and partially observed worlds (RTFM,Messenger, NetHack), as well as symbolic counterparts of visual worlds thatrequire interpreting rich natural language with respect to complex scenes(ALFWorld, Touchdown). Together, these environments provide diverse groundingchallenges in richness of observation space, action space, languagespecification, and plan complexity. In addition, we propose the first sharedmodel architecture for RL on these environments, and evaluate recent advancessuch as egocentric local convolution, recurrent state-tracking, entity-centricattention, and pretrained LM using SILG. Our shared architecture achievescomparable performance to environment-specific architectures. Moreover, we findthat many recent modelling advances do not result in significant gains onenvironments other than the one they were designed for. This highlights theneed for a multi-environment benchmark. Finally, the best models significantlyunderperform humans on SILG, which suggests ample room for future work. We hopeSILG enables the community to quickly identify new methodologies for languagegrounding that generalize to a diverse set of environments and their associatedchallenges.

Quick Read (beta)

loading the full paper ...