VGGT: Visual Geometry Grounded Transformer

Abstract

We present VGGT, a feed-forward neural network that directly infers all key3D attributes of a scene, including camera parameters, point maps, depth maps,and 3D point tracks, from one, a few, or hundreds of its views. This approachis a step forward in 3D computer vision, where models have typically beenconstrained to and specialized for single tasks. It is also simple andefficient, reconstructing images in under one second, and still outperformingalternatives that require post-processing with visual geometry optimizationtechniques. The network achieves state-of-the-art results in multiple 3D tasks,including camera parameter estimation, multi-view depth estimation, dense pointcloud reconstruction, and 3D point tracking. We also show that using pretrainedVGGT as a feature backbone significantly enhances downstream tasks, such asnon-rigid point tracking and feed-forward novel view synthesis. Code and modelsare publicly available at https://github.com/facebookresearch/vggt.

Quick Read (beta)

loading the full paper ...