FlexAttention for Efficient High-Resolution Vision-Language Models

Abstract

Current high-resolution vision-language models encode images ashigh-resolution image tokens and exhaustively take all these tokens to computeattention, which significantly increases the computational cost. To addressthis problem, we propose FlexAttention, a flexible attention mechanism forefficient high-resolution vision-language models. Specifically, ahigh-resolution image is encoded both as high-resolution tokens andlow-resolution tokens, where only the low-resolution tokens and a few selectedhigh-resolution tokens are utilized to calculate the attention map, whichgreatly shrinks the computational cost. The high-resolution tokens are selectedvia a high-resolution selection module which could retrieve tokens of relevantregions based on an input attention map. The selected high-resolution tokensare then concatenated to the low-resolution tokens and text tokens, and inputto a hierarchical self-attention layer which produces an attention map thatcould be used for the next-step high-resolution token selection. Thehierarchical self-attention process and high-resolution token selection processare performed iteratively for each attention layer. Experiments on multimodalbenchmarks prove that our FlexAttention outperforms existing high-resolutionVLMs (e.g., relatively ~9% in V* Bench, ~7% in TextVQA), while alsosignificantly reducing the computational cost by nearly 40%.

Quick Read (beta)

loading the full paper ...