Abstract
Developing embodied AI for intelligent surgical systems requires safe,controllable environments for continual learning and evaluation. However,safety regulations and operational constraints in operating rooms (ORs) limitembodied agents from freely perceiving and interacting in realistic settings.Digital twins provide high-fidelity, risk-free environments for exploration andtraining. How we may create photorealistic and dynamic digital representationsof ORs that capture relevant spatial, visual, and behavioral complexity remainsunclear. We introduce TwinOR, a framework for constructing photorealistic,dynamic digital twins of ORs for embodied AI research. The system reconstructsstatic geometry from pre-scan videos and continuously models human andequipment motion through multi-view perception of OR activities. The static anddynamic components are fused into an immersive 3D environment that supportscontrollable simulation and embodied exploration. The proposed frameworkreconstructs complete OR geometry with centimeter level accuracy whilepreserving dynamic interaction across surgical workflows, enabling realisticrenderings and a virtual playground for embodied AI systems. In ourexperiments, TwinOR simulates stereo and monocular sensor streams for geometryunderstanding and visual localization tasks. Models such as FoundationStereoand ORB-SLAM3 on TwinOR-synthesized data achieve performance within theirreported accuracy on real indoor datasets, demonstrating that TwinOR providessensor-level realism sufficient for perception and localization challenges. Byestablishing a real-to-sim pipeline for constructing dynamic, photorealisticdigital twins of OR environments, TwinOR enables the safe, scalable, anddata-efficient development and benchmarking of embodied AI, ultimatelyaccelerating the deployment of embodied AI from sim-to-real.