Abstract
We cast real-world humanoid control as a next token prediction problem, akinto predicting the next word in language. Our model is a causal transformertrained via autoregressive prediction of sensorimotor trajectories. To accountfor the multi-modal nature of the data, we perform prediction in amodality-aligned way, and for each input token predict the next token from thesame modality. This general formulation enables us to leverage data withmissing modalities, like video trajectories without actions. We train our modelon a collection of simulated trajectories coming from prior neural networkpolicies, model-based controllers, motion capture data, and YouTube videos ofhumans. We show that our model enables a full-sized humanoid to walk in SanFrancisco zero-shot. Our model can transfer to the real world even when trainedon only 27 hours of walking data, and can generalize to commands not seenduring training like walking backward. These findings suggest a promising pathtoward learning challenging real-world control tasks by generative modeling ofsensorimotor trajectories.