CoMemo: LVLMs Need Image Context with Image Memory

Abstract

Recent advancements in Large Vision-Language Models built upon Large LanguageModels have established aligning visual features with LLM representations asthe dominant paradigm. However, inherited LLM architectural designs introducesuboptimal characteristics for multimodal processing. First, LVLMs exhibit abimodal distribution in attention allocation, leading to the progressiveneglect of middle visual content as context expands. Second, conventionalpositional encoding schemes fail to preserve vital 2D structural relationshipswhen processing dynamic high-resolution images. To address these limitations,we propose CoMemo - a dual-path architecture that combines a Context image pathwith an image Memory path for visual processing, effectively alleviating visualinformation neglect. Additionally, we introduce RoPE-DHR, a novel positionalencoding mechanism that employs thumbnail-based positional aggregation tomaintain 2D spatial awareness while mitigating remote decay in extendedsequences. Evaluations across seven benchmarks,including long-contextcomprehension, multi-image reasoning, and visual question answering,demonstrate CoMemo's superior performance compared to conventional LVLMarchitectures. Project page is available athttps://lalbj.github.io/projects/CoMemo/.

Quick Read (beta)

loading the full paper ...