Abstract
Humans are able to seamlessly visually imitate others, by inferring theirintentions and using past experience to achieve the same end goal. In otherwords, we can parse complex semantic knowledge from raw video and efficientlytranslate that into concrete motor control. Is it possible to give a robot thissame capability? Prior research in robot imitation learning has created agentswhich can acquire diverse skills from expert human operators. However,expanding these techniques to work with a single positive example during testtime is still an open challenge. Apart from control, the difficulty stems frommismatches between the demonstrator and robot domains. For example, objects maybe placed in different locations (e.g. kitchen layouts are different in everyhouse). Additionally, the demonstration may come from an agent with differentmorphology and physical appearance (e.g. human), so one-to-one actioncorrespondences are not available. This paper investigates techniques whichallow robots to partially bridge these domain gaps, using their pastexperience. A neural network is trained to mimic ground truth robot actionsgiven context video from another agent, and must generalize to unseen taskinstances when prompted with new videos during test time. We hypothesize thatour policy representations must be both context driven and dynamics aware inorder to perform these tasks. These assumptions are baked into the neuralnetwork using the Transformers attention mechanism and a self-supervisedinverse dynamics loss. Finally, we experimentally determine that our methodaccomplishes a $\sim 2$x improvement in terms of task success rate over priorbaselines in a suite of one-shot manipulation tasks.