Abstract
Recent advancements in vision-language models have achieved remarkableresults in making language models understand vision inputs. However, a unifiedapproach to align these models across diverse tasks such as image captioningand visual question answering remains a challenge. Existing methods eitherrequire very big language models or very big datasets which is not efficient inutilizing existing models. This paper addresses this gap and devises a trainingstrategy of auto-regressive vision-language models, to unify vision-languagetasks like image-captioning and visual question answering. We propose fourtraining stages for aligning the vision model with the language model, in otherwords, the language model is given an ability to process visual inputs. We alsodevise different attention masks for training transformer-based language modelsthat improve the quality of visual features. Further, we introduce somefindings, 1) the attention mask should not be applied on visual inputs, 2) theLanguage model converges faster on AI- generated data, 3) More work should bedone in the alignment stage during the pre-training of the model, 4) the modelcan easily adapt to any downstream tasks like visual question answering onhealthcare datasets like PathVQA. After training the model for one epoch forall the stages, it outperforms large models like VILA-13 billion models oncommon benchmarks like CIDEr scores on COCO and Flickr30k datasets and achievesvery close scores to GIT-2 on the same dataset despite being a much smallermodel trained on a much smaller dataset. All of the training is done using bestpractices available like multi- GPU parallel training, lower-precision trainingwith 16-bit float numbers, faster attention (SDPA), and gradient accumulation,and completed the training within 12 hours.