Abstract
We propose MVSplat, an efficient feed-forward 3D Gaussian Splatting modellearned from sparse multi-view images. To accurately localize the Gaussiancenters, we propose to build a cost volume representation via plane sweeping inthe 3D space, where the cross-view feature similarities stored in the costvolume can provide valuable geometry cues to the estimation of depth. We learnthe Gaussian primitives' opacities, covariances, and spherical harmonicscoefficients jointly with the Gaussian centers while only relying onphotometric supervision. We demonstrate the importance of the cost volumerepresentation in learning feed-forward Gaussian Splatting models via extensiveexperimental evaluations. On the large-scale RealEstate10K and ACID benchmarks,our model achieves state-of-the-art performance with the fastest feed-forwardinference speed (22 fps). Compared to the latest state-of-the-art methodpixelSplat, our model uses $10\times $ fewer parameters and infers more than$2\times$ faster while providing higher appearance and geometry quality as wellas better cross-dataset generalization.