Abstract
Large-scale pretraining and instruction tuning have been successful fortraining general-purpose language models with broad competencies. However,extending to general-purpose vision-language models is challenging due to thedistributional diversity in visual inputs. A recent line of work exploresvision-language instruction tuning, taking inspiration from the QueryTransformer (QFormer) approach proposed in BLIP-2 models for bridging frozenmodalities. However, these approaches rely heavily on large-scale multi-modalpretraining for representation learning before eventual finetuning, incurring ahuge computational overhead, poor scaling, and limited accessibility. To thatend, we propose a more efficient method for QFormer-based vision-languagealignment and demonstrate the effectiveness of our strategy compared toexisting baselines in improving the efficiency of vision-language pretraining.