Abstract
Sparse algorithms offer great flexibility for multi-view temporal perceptiontasks. In this paper, we present an enhanced version of Sparse4D, in which weimprove the temporal fusion module by implementing a recursive form ofmulti-frame feature sampling. By effectively decoupling image features andstructured anchor features, Sparse4D enables a highly efficient transformationof temporal features, thereby facilitating temporal fusion solely through theframe-by-frame transmission of sparse features. The recurrent temporal fusionapproach provides two main benefits. Firstly, it reduces the computationalcomplexity of temporal fusion from $O(T)$ to $O(1)$, resulting in significantimprovements in inference speed and memory usage. Secondly, it enables thefusion of long-term information, leading to more pronounced performanceimprovements due to temporal fusion. Our proposed approach, Sparse4Dv2, furtherenhances the performance of the sparse perception algorithm and achievesstate-of-the-art results on the nuScenes 3D detection benchmark. Code will beavailable at \url{https://github.com/linxuewu/Sparse4D}.