IRASim: A Fine-Grained World Model for Robot Manipulation

Abstract

World models allow autonomous agents to plan and explore by predicting thevisual outcomes of different actions. However, for robot manipulation, it ischallenging to accurately model the fine-grained robot-object interactionwithin the visual space using existing methods which overlooks precisealignment between each action and the corresponding frame. In this paper, wepresent IRASim, a novel world model capable of generating videos withfine-grained robot-object interaction details, conditioned on historicalobservations and robot action trajectories. We train a diffusion transformerand introduce a novel frame-level action-conditioning module within eachtransformer block to explicitly model and strengthen the action-framealignment. Extensive experiments show that: (1) the quality of the videosgenerated by our method surpasses all the baseline methods and scaleseffectively with increased model size and computation; (2) policy evaluationsusing IRASim exhibit a strong correlation with those using the ground-truthsimulator, highlighting its potential to accelerate real-world policyevaluation; (3) testing-time scaling through model-based planning with IRASimsignificantly enhances policy performance, as evidenced by an improvement inthe IoU metric on the Push-T benchmark from 0.637 to 0.961; (4) IRASim providesflexible action controllability, allowing virtual robotic arms in datasets tobe controlled via a keyboard or VR controller.

Quick Read (beta)

loading the full paper ...