We present a method for estimating neural scenes representations of objectsgiven only a single image. The core of our method is the estimation of ageometric scaffold for the object and its use as a guide for the reconstructionof the underlying radiance field. Our formulation is based on a generativeprocess that first maps a latent code to a voxelized shape, and then renders itto an image, with the object appearance being controlled by a second latentcode. During inference, we optimize both the latent codes and the networks tofit a test image of a new object. The explicit disentanglement of shape andappearance allows our model to be fine-tuned given a single image. We can thenrender new views in a geometrically consistent manner and they representfaithfully the input object. Additionally, our method is able to generalize toimages outside of the training domain (more realistic renderings and even realphotographs). Finally, the inferred geometric scaffold is itself an accurateestimate of the object's 3D shape. We demonstrate in several experiments theeffectiveness of our approach in both synthetic and real images.