Quantized Visual Geometry Grounded Transformer

Abstract

Learning-based 3D reconstruction models, represented by Visual GeometryGrounded Transformers (VGGTs), have made remarkable progress with the use oflarge-scale transformers. Their prohibitive computational and memory costsseverely hinder real-world deployment. Post-Training Quantization (PTQ) hasbecome a common practice for compressing and accelerating models. However, weempirically observe that PTQ faces unique obstacles when compressingbillion-scale VGGTs: the data-independent special tokens induce heavy-tailedactivation distributions, while the multi-view nature of 3D data makescalibration sample selection highly unstable. This paper proposes the firstQuantization framework for VGGTs, namely QuantVGGT. This mainly relies on twotechnical contributions: First, we introduce Dual-Smoothed Fine-GrainedQuantization, which integrates pre-global Hadamard rotation and post-localchannel smoothing to mitigate heavy-tailed distributions and inter-channelvariance robustly. Second, we design Noise-Filtered Diverse Sampling, whichfilters outliers via deep-layer statistics and constructs frame-aware diversecalibration clusters to ensure stable quantization ranges. Comprehensiveexperiments demonstrate that QuantVGGT achieves the state-of-the-art resultsacross different benchmarks and bit-width, surpassing the previousstate-of-the-art generic quantization method with a great margin. We highlightthat our 4-bit QuantVGGT can deliver a 3.7$\times$ memory reduction and2.5$\times$ acceleration in real-hardware inference, while maintainingreconstruction accuracy above 98\% of its full-precision counterpart. Thisdemonstrates the vast advantages and practicality of QuantVGGT inresource-constrained scenarios. Our code is released inhttps://github.com/wlfeng0509/QuantVGGT.

Quick Read (beta)

loading the full paper ...