Abstract
We demonstrate how a sampling-based robotic planner can be augmented to learnto understand a sequence of natural language commands in a continuousconfiguration space to move and manipulate objects. Our approach combines adeep network structured according to the parse of a complex command thatincludes objects, verbs, spatial relations, and attributes, with asampling-based planner, RRT. A recurrent hierarchical deep network controls howthe planner explores the environment, determines when a planned path is likelyto achieve a goal, and estimates the confidence of each move to trade offexploitation and exploration between the network and the planner. Planners aredesigned to have near-optimal behavior when information about the task ismissing, while networks learn to exploit observations which are available fromthe environment, making the two naturally complementary. Combining the twoenables generalization to new maps, new kinds of obstacles, and more complexsentences that do not occur in the training set. Little data is required totrain the model despite it jointly acquiring a CNN that extracts features fromthe environment as it learns the meanings of words. The model provides a levelof interpretability through the use of attention maps allowing users to see itsreasoning steps despite being an end-to-end model. This end-to-end model allowsrobots to learn to follow natural language commands in challenging continuousenvironments.