Abstract
Pre-trained language-vision models have shown remarkable performance on thevisual question answering (VQA) task. However, most pre-trained models aretrained by only considering monolingual learning, especially the resource-richlanguage like English. Training such models for multilingual setups demand highcomputing resources and multilingual language-vision dataset which hinderstheir application in practice. To alleviate these challenges, we propose aknowledge distillation approach to extend an English language-vision model(teacher) into an equally effective multilingual and code-mixed model(student). Unlike the existing knowledge distillation methods, which only usethe output from the last layer of the teacher network for distillation, ourstudent model learns and imitates the teacher from multiple intermediate layers(language and vision encoders) with appropriately designed distillationobjectives for incremental knowledge extraction. We also create the large-scalemultilingual and code-mixed VQA dataset in eleven different language setupsconsidering the multiple Indian and European languages. Experimental resultsand in-depth analysis show the effectiveness of the proposed VQA model over thepre-trained language-vision models on eleven diverse language setups.