CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

Abstract

A remarkable ability of human beings resides in compositional reasoning,i.e., the capacity to make "infinite use of finite means". However, currentlarge vision-language foundation models (VLMs) fall short of such compositionalabilities due to their "bag-of-words" behaviors and inability to constructwords that correctly represent visual entities and the relations among theentities. To this end, we propose CoVLM, which can guide the LLM to explicitlycompose visual entities and relationships among the text and dynamicallycommunicate with the vision encoder and detection network to achievevision-language communicative decoding. Specifically, we first devise a set ofnovel communication tokens for the LLM, for dynamic communication between thevisual detection system and the language system. A communication token isgenerated by the LLM following a visual entity or a relation, to inform thedetection network to propose regions that are relevant to the sentencegenerated so far. The proposed regions-of-interests (ROIs) are then fed backinto the LLM for better language generation contingent on the relevant regions.The LLM is thus able to compose the visual entities and relationships throughthe communication tokens. The vision-to-language and language-to-visioncommunication are iteratively performed until the entire sentence is generated.Our framework seamlessly bridges the gap between visual perception and LLMs andoutperforms previous VLMs by a large margin on compositional reasoningbenchmarks (e.g., ~20% in HICO-DET mAP, ~14% in Cola top-1 accuracy, and ~3% onARO top-1 accuracy). We also achieve state-of-the-art performances ontraditional vision-language tasks such as referring expression comprehensionand visual question answering.

Quick Read (beta)

loading the full paper ...