Abstract
Several video understanding tasks, such as natural language temporal videogrounding, temporal activity localization, and audio description generation,require "temporally dense" reasoning over frames sampled at high temporalresolution. However, computing frame-level features for these tasks iscomputationally expensive given the temporal resolution requirements. In thispaper, we make three contributions to reduce the cost of computing features fortemporally dense tasks. First, we introduce a vision transformer (ViT)architecture, dubbed ResidualViT, that leverages the large temporal redundancyin videos to efficiently compute temporally dense frame-level features. Ourarchitecture incorporates (i) learnable residual connections that ensuretemporal consistency across consecutive frames and (ii) a token reductionmodule that enhances processing speed by selectively discarding temporallyredundant information while reusing weights of a pretrained foundation model.Second, we propose a lightweight distillation strategy to approximate theframe-level features of the original foundation model. Finally, we evaluate ourapproach across four tasks and five datasets, in both zero-shot and fullysupervised settings, demonstrating significant reductions in computational cost(up to 60%) and improvements in inference speed (up to 2.5x faster), all whileclosely approximating the accuracy of the original foundation model.