Inference Optimal VLMs Need Only One Visual Token but Larger Models

Abstract

Vision Language Models (VLMs) have demonstrated strong capabilities acrossvarious visual understanding and reasoning tasks. However, their real-worlddeployment is often constrained by high latency during inference due tosubstantial compute required to process the large number of input tokens(predominantly from the image) by the LLM. To reduce inference costs, one caneither downsize the LLM or reduce the number of input image-tokens, the latterof which has been the focus of many recent works around token compression.However, it is unclear what the optimal trade-off is, as both the factorsdirectly affect the VLM performance. We first characterize this optimaltrade-off between the number of visual tokens and LLM parameters byestablishing scaling laws that capture variations in performance with these twofactors. Our results reveal a surprising trend: for visual reasoning tasks, theinference-optimal behavior in VLMs, i.e., minimum downstream error at any givenfixed inference compute, is achieved when using the largest LLM that fitswithin the inference budget while minimizing visual token count - often to asingle token. While the token reduction literature has mainly focused onmaintaining base model performance by modestly reducing the token count (e.g.,$5-10\times$), our results indicate that the compute-optimal inference regimerequires operating under even higher token compression ratios. Based on theseinsights, we take some initial steps towards building approaches tailored forhigh token compression settings. Code is available athttps://github.com/locuslab/llava-token-compression.

Quick Read (beta)

loading the full paper ...