We propose NeRF-VAE, a 3D scene generative model that incorporates geometricstructure via NeRF and differentiable volume rendering. In contrast to NeRF,our model takes into account shared structure across scenes, and is able toinfer the structure of a novel scene -- without the need to re-train -- usingamortized inference. NeRF-VAE's explicit 3D rendering process further contrastsprevious generative models with convolution-based rendering which lacksgeometric structure. Our model is a VAE that learns a distribution overradiance fields by conditioning them on a latent scene representation. We showthat, once trained, NeRF-VAE is able to infer and rendergeometrically-consistent scenes from previously unseen 3D environments usingvery few input images. We further demonstrate that NeRF-VAE generalizes well toout-of-distribution cameras, while convolutional models do not. Finally, weintroduce and study an attention-based conditioning mechanism of NeRF-VAE'sdecoder, which improves model performance.