CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion

Abstract

Diffusion Policy (DP) enables robots to learn complex behaviors by imitatingexpert demonstrations through action diffusion. However, in practicalapplications, hardware limitations often degrade data quality, while real-timeconstraints restrict model inference to instantaneous state and sceneobservations. These limitations seriously reduce the efficacy of learning fromexpert demonstrations, resulting in failures in object localization, graspplanning, and long-horizon task execution. To address these challenges, wepropose Causal Diffusion Policy (CDP), a novel transformer-based diffusionmodel that enhances action prediction by conditioning on historical actionsequences, thereby enabling more coherent and context-aware visuomotor policylearning. To further mitigate the computational cost associated withautoregressive inference, a caching mechanism is also introduced to storeattention key-value pairs from previous timesteps, substantially reducingredundant computations during execution. Extensive experiments in bothsimulated and real-world environments, spanning diverse 2D and 3D manipulationtasks, demonstrate that CDP uniquely leverages historical action sequences toachieve significantly higher accuracy than existing methods. Moreover, evenwhen faced with degraded input observation quality, CDP maintains remarkableprecision by reasoning through temporal continuity, which highlights itspractical robustness for robotic control under realistic, imperfect conditions.

Quick Read (beta)

loading the full paper ...