Diffusion models currently achieve state-of-the-art performance for bothconditional and unconditional image generation. However, so far, imagediffusion models do not support tasks required for 3D understanding, such asview-consistent 3D generation or single-view object reconstruction. In thispaper, we present RenderDiffusion as the first diffusion model for 3Dgeneration and inference that can be trained using only monocular 2Dsupervision. At the heart of our method is a novel image denoising architecturethat generates and renders an intermediate three-dimensional representation ofa scene in each denoising step. This enforces a strong inductive structure intothe diffusion process that gives us a 3D consistent representation while onlyrequiring 2D supervision. The resulting 3D representation can be rendered fromany viewpoint. We evaluate RenderDiffusion on ShapeNet and Clevr datasets andshow competitive performance for generation of 3D scenes and inference of 3Dscenes from 2D images. Additionally, our diffusion-based approach allows us touse 2D inpainting to edit 3D scenes. We believe that our work promises toenable full 3D generation at scale when trained on massive image collections,thus circumventing the need to have large-scale 3D model collections forsupervision.