STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer

Abstract

We present STream3R, a novel approach to 3D reconstruction that reformulatespointmap prediction as a decoder-only Transformer problem. Existingstate-of-the-art methods for multi-view reconstruction either depend onexpensive global optimization or rely on simplistic memory mechanisms thatscale poorly with sequence length. In contrast, STream3R introduces anstreaming framework that processes image sequences efficiently using causalattention, inspired by advances in modern language modeling. By learninggeometric priors from large-scale 3D datasets, STream3R generalizes well todiverse and challenging scenarios, including dynamic scenes where traditionalmethods often fail. Extensive experiments show that our method consistentlyoutperforms prior work across both static and dynamic scene benchmarks.Moreover, STream3R is inherently compatible with LLM-style traininginfrastructure, enabling efficient large-scale pretraining and fine-tuning forvarious downstream 3D tasks. Our results underscore the potential of causalTransformer models for online 3D perception, paving the way for real-time 3Dunderstanding in streaming environments. More details can be found in ourproject page: https://nirvanalan.github.io/projects/stream3r.

Quick Read (beta)

loading the full paper ...