Abstract
World models allow autonomous agents to plan and explore by predicting thevisual outcomes of different actions. However, for robot manipulation, it ischallenging to accurately model the fine-grained robot-object interactionwithin the visual space using existing methods which overlooks precisealignment between each action and the corresponding frame. In this paper, wepresent IRASim, a novel world model capable of generating videos withfine-grained robot-object interaction details, conditioned on historicalobservations and robot action trajectories. We train a diffusion transformerand introduce a novel frame-level action-conditioning module within eachtransformer block to explicitly model and strengthen the action-framealignment. Extensive experiments show that: (1) the quality of the videosgenerated by our method surpasses all the baseline methods and scaleseffectively with increased model size and computation; (2) policy evaluationsusing IRASim exhibit a strong correlation with those using the ground-truthsimulator, highlighting its potential to accelerate real-world policyevaluation; (3) testing-time scaling through model-based planning with IRASimsignificantly enhances policy performance, as evidenced by an improvement inthe IoU metric on the Push-T benchmark from 0.637 to 0.961; (4) IRASim providesflexible action controllability, allowing virtual robotic arms in datasets tobe controlled via a keyboard or VR controller.