Abstract
We introduce GAUDI, a generative model capable of capturing the distributionof complex and realistic 3D scenes that can be rendered immersively from amoving camera. We tackle this challenging problem with a scalable yet powerfulapproach, where we first optimize a latent representation that disentanglesradiance fields and camera poses. This latent representation is then used tolearn a generative model that enables both unconditional and conditionalgeneration of 3D scenes. Our model generalizes previous works that focus onsingle objects by removing the assumption that the camera pose distribution canbe shared across samples. We show that GAUDI obtains state-of-the-artperformance in the unconditional generative setting across multiple datasetsand allows for conditional generation of 3D scenes given conditioning variableslike sparse image observations or text that describes the scene.