Abstract
This paper aims to address the challenge of reconstructing long volumetricvideos from multi-view RGB videos. Recent dynamic view synthesis methodsleverage powerful 4D representations, like feature grids or point cloudsequences, to achieve high-quality rendering results. However, they aretypically limited to short (1~2s) video clips and often suffer from largememory footprints when dealing with longer videos. To solve this issue, wepropose a novel 4D representation, named Temporal Gaussian Hierarchy, tocompactly model long volumetric videos. Our key observation is that there aregenerally various degrees of temporal redundancy in dynamic scenes, whichconsist of areas changing at different speeds. Motivated by this, our approachbuilds a multi-level hierarchy of 4D Gaussian primitives, where each levelseparately describes scene regions with different degrees of content change,and adaptively shares Gaussian primitives to represent unchanged scene contentover different temporal segments, thus effectively reducing the number ofGaussian primitives. In addition, the tree-like structure of the Gaussianhierarchy allows us to efficiently represent the scene at a particular momentwith a subset of Gaussian primitives, leading to nearly constant GPU memoryusage during the training or rendering regardless of the video length.Extensive experimental results demonstrate the superiority of our method overalternative methods in terms of training cost, rendering speed, and storageusage. To our knowledge, this work is the first approach capable of efficientlyhandling minutes of volumetric video data while maintaining state-of-the-artrendering quality. Our project page is available at:https://zju3dv.github.io/longvolcap.