Abstract
We present LangToMo, a vision-language-action framework structured as adual-system architecture that uses pixel motion forecasts as intermediaterepresentations. Our high-level System 2, an image diffusion model, generatestext-conditioned pixel motion sequences from a single frame to guide robotcontrol. Pixel motion-a universal, interpretable, and motion-centricrepresentation-can be extracted from videos in a self-supervised manner,enabling diffusion model training on web-scale video-caption data. Treatinggenerated pixel motion as learned universal representations, our low levelSystem 1 module translates these into robot actions via motion-to-actionmapping functions, which can be either hand-crafted or learned with minimalsupervision. System 2 operates as a high-level policy applied at sparsetemporal intervals, while System 1 acts as a low-level policy at dense temporalintervals. This hierarchical decoupling enables flexible, scalable, andgeneralizable robot control under both unsupervised and supervised settings,bridging the gap between language, motion, and action. Checkouthttps://kahnchana.github.io/LangToMo for visualizations.