Abstract
Imitation learning has proven to be a powerful tool for training complexvisuomotor policies. However, current methods often require hundreds tothousands of expert demonstrations to handle high-dimensional visualobservations. A key reason for this poor data efficiency is that visualrepresentations are predominantly either pretrained on out-of-domain data ortrained directly through a behavior cloning objective. In this work, we presentDynaMo, a new in-domain, self-supervised method for learning visualrepresentations. Given a set of expert demonstrations, we jointly learn alatent inverse dynamics model and a forward dynamics model over a sequence ofimage embeddings, predicting the next frame in latent space, withoutaugmentations, contrastive sampling, or access to ground truth actions.Importantly, DynaMo does not require any out-of-domain data such as Internetdatasets or cross-embodied datasets. On a suite of six simulated and realenvironments, we show that representations learned with DynaMo significantlyimprove downstream imitation learning performance over prior self-supervisedlearning objectives, and pretrained representations. Gains from using DynaMohold across policy classes such as Behavior Transformer, Diffusion Policy, MLP,and nearest neighbors. Finally, we ablate over key components of DynaMo andmeasure its impact on downstream policy performance. Robot videos are bestviewed at https://dynamo-ssl.github.io