ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations

Abstract

Multi-modal large language models have demonstrated remarkable zero-shotabilities and powerful image-understanding capabilities. However, the existingopen-source multi-modal models suffer from the weak capability of multi-turninteraction, especially for long contexts. To address the issue, we firstintroduce a context modeling module, termed ContextQFormer, which utilizes amemory block to enhance the presentation of contextual information.Furthermore, to facilitate further research, we carefully build a newmulti-turn multi-modal dialogue dataset (TMDialog) for pre-training,instruction-tuning, and evaluation, which will be open-sourced lately. Comparedwith other multi-modal dialogue datasets, TMDialog contains longerconversations, which supports the research of multi-turn multi-modal dialogue.In addition, ContextQFormer is compared with three baselines on TMDialog andexperimental results illustrate that ContextQFormer achieves an improvement of2%-4% in available rate over baselines.

Quick Read (beta)

loading the full paper ...