Abstract
General-purpose language models that can solve various language-domain taskshave emerged driven by the pre-training and instruction-tuning pipeline.However, building general-purpose vision-language models is challenging due tothe increased task discrepancy introduced by the additional visual input.Although vision-language pre-training has been widely studied, vision-languageinstruction tuning remains relatively less explored. In this paper, we conducta systematic and comprehensive study on vision-language instruction tuningbased on the pre-trained BLIP-2 models. We gather a wide variety of 26 publiclyavailable datasets, transform them into instruction tuning format andcategorize them into two clusters for held-in instruction tuning and held-outzero-shot evaluation. Additionally, we introduce instruction-aware visualfeature extraction, a crucial method that enables the model to extractinformative features tailored to the given instruction. The resultingInstructBLIP models achieve state-of-the-art zero-shot performance across all13 held-out datasets, substantially outperforming BLIP-2 and the largerFlamingo. Our models also lead to state-of-the-art performance when finetunedon individual downstream tasks (e.g., 90.7% accuracy on ScienceQA IMG).Furthermore, we qualitatively demonstrate the advantages of InstructBLIP overconcurrent multimodal models. All InstructBLIP models have been open-sourced athttps://github.com/salesforce/LAVIS/tree/main/projects/instructblip.