MLAN: Language-Based Instruction Tuning Improves Zero-Shot Generalization of Multimodal Large Language Models

Abstract

We present a novel instruction tuning recipe to improve the zero-shot taskgeneralization of multimodal large language models. In contrast to existinginstruction tuning mechanisms that heavily rely on visual instructions, ourapproach focuses on language-based instruction tuning, offering a distinct andmore training efficient path for multimodal instruction tuning. We evaluate theperformance of the proposed approach on 9 unseen datasets across both languageand vision modalities. Our results show that our language-only instructiontuning is able to significantly improve the performance of two pretrainedmultimodal models based on Llama 2 and Vicuna on those unseen datasets.Interestingly, the language instruction following ability also helps unlock themodels to follow vision instructions without explicit training. Compared to thestate of the art multimodal instruction tuning approaches that are mainly basedon visual instructions, our language-based method not only achieves superiorperformance but also significantly enhances training efficiency. For instance,the language-only instruction tuning produces competitive average performanceacross the evaluated datasets (with even better performance on languagedatasets) with significant training efficiency improvements (on average 4x),thanks to the striking reduction in the need for vision data. With a smallnumber of visual instructions, this emerging language instruction followingability transfers well to the unseen vision datasets, outperforming the stateof the art with greater training efficiency.

Quick Read (beta)

loading the full paper ...