Abstract
We explore using latent natural language instructions as an expressive andcompositional representation of complex actions for hierarchical decisionmaking. Rather than directly selecting micro-actions, our agent first generatesa latent plan in natural language, which is then executed by a separate model.We introduce a challenging real-time strategy game environment in which theactions of a large number of units must be coordinated across long time scales.We gather a dataset of 76 thousand pairs of instructions and executions fromhuman play, and train instructor and executor models. Experiments show thatmodels using natural language as a latent variable significantly outperformmodels that directly imitate human actions. The compositional structure oflanguage proves crucial to its effectiveness for action representation. We alsorelease our code, models and data.