Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image

Abstract

In this paper, we propose Flash3D, a method for scene reconstruction andnovel view synthesis from a single image which is both very generalisable andefficient. For generalisability, we start from a "foundation" model formonocular depth estimation and extend it to a full 3D shape and appearancereconstructor. For efficiency, we base this extension on feed-forward GaussianSplatting. Specifically, we predict a first layer of 3D Gaussians at thepredicted depth, and then add additional layers of Gaussians that are offset inspace, allowing the model to complete the reconstruction behind occlusions andtruncations. Flash3D is very efficient, trainable on a single GPU in a day, andthus accessible to most researchers. It achieves state-of-the-art results whentrained and tested on RealEstate10k. When transferred to unseen datasets likeNYU it outperforms competitors by a large margin. More impressively, whentransferred to KITTI, Flash3D achieves better PSNR than methods trainedspecifically on that dataset. In some instances, it even outperforms recentmethods that use multiple views as input. Code, models, demo, and more resultsare available at https://www.robots.ox.ac.uk/~vgg/research/flash3d/.

Quick Read (beta)

loading the full paper ...