Matryoshka Query Transformer for Large Vision-Language Models

Abstract

Large Vision-Language Models (LVLMs) typically encode an image into a fixednumber of visual tokens (e.g., 576) and process these tokens with a languagemodel. Despite their strong performance, LVLMs face challenges in adapting tovarying computational constraints. This raises the question: can we achieveflexibility in the number of visual tokens to suit different tasks andcomputational resources? We answer this with an emphatic yes. Inspired byMatryoshka Representation Learning, we introduce the Matryoshka QueryTransformer (MQT), capable of encoding an image into m visual tokens duringinference, where m can be any number up to a predefined maximum. This isachieved by employing a query transformer with M latent query tokens tocompress the visual embeddings. During each training step, we randomly select m<= M latent query tokens and train the model using only these first m tokens,discarding the rest. Combining MQT with LLaVA, we train a single model once,and flexibly and drastically reduce the number of inference-time visual tokenswhile maintaining similar or better performance compared to trainingindependent models for each number of tokens. Our model, MQT-LLAVA, matchesLLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokensinstead of LLaVA's fixed 576. Reducing to 16 tokens (8x less TFLOPs) onlysacrifices the performance by 2.4 points on MMBench. On certain tasks such asScienceQA and MMMU, we can even go down to only 2 visual tokens withperformance drops of just 3% and 6% each. Our exploration of the trade-offbetween the accuracy and computational cost brought about by the number ofvisual tokens facilitates future research to achieve the best of both worlds.

Quick Read (beta)

loading the full paper ...