PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

Abstract

Large Vision-Language Models (VLMs) have been extended to understand bothimages and videos. Visual token compression is leveraged to reduce theconsiderable token length of visual inputs. To meet the needs of differenttasks, existing high-performance models usually process images and videosseparately with different token compression strategies, limiting thecapabilities of combining images and videos. To this end, we extend each imageinto a "static" video and introduce a unified token compression strategy calledProgressive Visual Token Compression (PVC), where the tokens of each frame areprogressively encoded and adaptively compressed to supplement the informationnot extracted from previous frames. Video tokens are efficiently compressedwith exploiting the inherent temporal redundancy. Images are repeated as staticvideos, and the spatial details can be gradually supplemented in multipleframes. PVC unifies the token compressing of images and videos. With a limitednumber of tokens per frame (64 tokens by default), spatial details and temporalchanges can still be preserved. Experiments show that our model achievesstate-of-the-art performance across various video understanding benchmarks,including long video tasks and fine-grained short video tasks. Meanwhile, ourunified token compression strategy incurs no performance loss on imagebenchmarks, particularly in detail-sensitive tasks.

Quick Read (beta)

loading the full paper ...