### Abstract

We show how to train a fully convolutional neural network to perform inverserendering from a single, uncontrolled image. The network takes an RGB image asinput, regresses albedo and normal maps from which we compute lightingcoefficients. Our network is trained using large uncontrolled image collectionswithout ground truth. By incorporating a differentiable renderer, our networkcan learn from self-supervision. Since the problem is ill-posed we introduceadditional supervision: 1. We learn a statistical natural illumination prior,2. Our key insight is to perform offline multiview stereo (MVS) on imagescontaining rich illumination variation. From the MVS pose and depth maps, wecan cross project between overlapping views such that Siamese training can beused to ensure consistent estimation of photometric invariants. MVS depth alsoprovides direct coarse supervision for normal map estimation. We believe thisis the first attempt to use MVS supervision for learning inverse rendering.