Abstract
We propose a new simulator, training approach, and policy architecture,collectively called SOUS VIDE, for end-to-end visual drone navigation. Ourtrained policies exhibit zero-shot sim-to-real transfer with robust real-worldperformance using only onboard perception and computation. Our simulator,called FiGS, couples a computationally simple drone dynamics model with a highvisual fidelity Gaussian Splatting scene reconstruction. FiGS can quicklysimulate drone flights producing photorealistic images at up to 130 fps. We useFiGS to collect 100k-300k image/state-action pairs from an expert MPC withprivileged state and dynamics information, randomized over dynamics parametersand spatial disturbances. We then distill this expert MPC into an end-to-endvisuomotor policy with a lightweight neural architecture, called SV-Net. SV-Netprocesses color image, optical flow and IMU data streams into low-level thrustand body rate commands at 20 Hz onboard a drone. Crucially, SV-Net includes alearned module for low-level control that adapts at runtime to variations indrone dynamics. In a campaign of 105 hardware experiments, we show SOUS VIDEpolicies to be robust to 30% mass variations, 40 m/s wind gusts, 60% changes inambient brightness, shifting or removing objects from the scene, and peoplemoving aggressively through the drone's visual field. Code, data, andexperiment videos can be found on our project page:https://stanfordmsl.github.io/SousVide/.