Physical Autoregressive Model for Robotic Manipulation without Action Pretraining

Abstract

The scarcity of manipulation data has motivated the use of pretrained largemodels from other modalities in robotics. In this work, we build uponautoregressive video generation models to propose a Physical AutoregressiveModel (PAR), where physical tokens combine frames and actions to represent thejoint evolution of the robot and its environment. PAR leverages the worldknowledge embedded in video pretraining to understand physical dynamics withoutrequiring action pretraining, enabling accurate video prediction and consistentaction trajectories. It also adopts a DiT-based de-tokenizer to model framesand actions as continuous tokens, mitigating quantization errors andfacilitating mutual enhancement. Furthermore, we incorporate a causal mask withinverse kinematics, parallel training, and the KV-cache mechanism to furtherimprove performance and efficiency. Experiments on the ManiSkill benchmark showthat PAR achieves a 100\% success rate on the PushCube task, matches theperformance of action-pretrained baselines on other tasks, and accuratelypredicts future videos with tightly aligned action trajectories. These findingsunderscore a promising direction for robotic manipulation by transferring worldknowledge from autoregressive video pretraining.

Quick Read (beta)

loading the full paper ...