The Surprising Effectiveness of Representation Learning for Visual Imitation

Abstract

While visual imitation learning offers one of the most effective ways oflearning from visual demonstrations, generalizing from them requires eitherhundreds of diverse demonstrations, task specific priors, or large,hard-to-train parametric models. One reason such complexities arise is becausestandard visual imitation frameworks try to solve two coupled problems at once:learning a succinct but good representation from the diverse visual data, whilesimultaneously learning to associate the demonstrated actions with suchrepresentations. Such joint learning causes an interdependence between thesetwo problems, which often results in needing large amounts of demonstrationsfor learning. To address this challenge, we instead propose to decouplerepresentation learning from behavior learning for visual imitation. First, welearn a visual representation encoder from offline data using standardsupervised and self-supervised learning methods. Once the representations aretrained, we use non-parametric Locally Weighted Regression to predict theactions. We experimentally show that this simple decoupling improves theperformance of visual imitation models on both offline demonstration datasetsand real-robot door opening compared to prior work in visual imitation. All ofour generated data, code, and robot videos are publicly available athttps://jyopari.github.io/VINN/.

Quick Read (beta)

loading the full paper ...