xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

Abstract

We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model forvideos, particularly designed to efficiently capture temporal information overmultiple frames. BLIP-3-Video takes advantage of the 'temporal encoder' inaddition to the conventional visual tokenizer, which maps a sequence of tokensover multiple frames into a compact set of visual tokens. This enablesBLIP3-Video to use much fewer visual tokens than its competing models (e.g., 32vs. 4608 tokens). We explore different types of temporal encoders, includinglearnable spatio-temporal pooling as well as sequential models like TokenTuring Machines. We experimentally confirm that BLIP-3-Video obtains videoquestion-answering accuracies comparable to much larger state-of-the-art models(e.g., 34B), while being much smaller (i.e., 4B) and more efficient by usingfewer visual tokens. The project website is athttps://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html

Quick Read (beta)

loading the full paper ...