One For All: Video Conversation is Feasible Without Video Instruction Tuning

Abstract

The recent progress in Large Language Models (LLM) has spurred variousadvancements in image-language conversation agents, while how to build aproficient video-based dialogue system is still under exploration. Consideringthe extensive scale of LLM and visual backbone, minimal GPU memory is left forfacilitating effective temporal modeling, which is crucial for comprehendingand providing feedback on videos. To this end, we propose Branching TemporalAdapter (BT-Adapter), a novel method for extending image-language pretrainedmodels into the video domain. Specifically, BT-Adapter serves as a plug-and-usetemporal modeling branch alongside the pretrained visual encoder, which istuned while keeping the backbone frozen. Just pretrained once, BT-Adapter canbe seamlessly integrated into all image conversation models using this versionof CLIP, enabling video conversations without the need for video instructions.Besides, we develop a unique asymmetric token masking strategy inside thebranch with tailor-made training tasks for BT-Adapter, facilitating fasterconvergence and better results. Thanks to BT-Adapter, we are able to empowerexisting multimodal dialogue models with strong video understandingcapabilities without incurring excessive GPU costs. Without bells and whistles,BT-Adapter achieves (1) state-of-the-art zero-shot results on various videotasks using thousands of fewer GPU hours. (2) better performance than currentvideo chatbots without any video instruction tuning. (3) state-of-the-artresults of video chatting using video instruction tuning, outperformingprevious SOTAs by a large margin.

Quick Read (beta)

loading the full paper ...