Abstract
We introduce Genie Envisioner (GE), a unified world foundation platform forrobotic manipulation that integrates policy learning, evaluation, andsimulation within a single video-generative framework. At its core, GE-Base isa large-scale, instruction-conditioned video diffusion model that captures thespatial, temporal, and semantic dynamics of real-world robotic interactions ina structured latent space. Built upon this foundation, GE-Act maps latentrepresentations to executable action trajectories through a lightweight,flow-matching decoder, enabling precise and generalizable policy inferenceacross diverse embodiments with minimal supervision. To support scalableevaluation and training, GE-Sim serves as an action-conditioned neuralsimulator, producing high-fidelity rollouts for closed-loop policy development.The platform is further equipped with EWMBench, a standardized benchmark suitemeasuring visual fidelity, physical consistency, and instruction-actionalignment. Together, these components establish Genie Envisioner as a scalableand practical foundation for instruction-driven, general-purpose embodiedintelligence. All code, models, and benchmarks will be released publicly.