LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages in Multimodal Image Retrieval Task

Abstract

This research explores the development of multimodal vision-language modelsfor image retrieval in low-resource languages, specifically Azerbaijani.Existing vision-language models primarily support high-resource languages, andfine-tuning them remains computationally demanding. To address challenges invision-language retrieval for low-resource languages, we integrated the CLIPmodel architecture and employed several techniques to balance computationalefficiency with performance. These techniques include synthetic data generationthrough machine translation, image augmentation, and further training theattention mechanisms of transformer-based models with domain-specific data. Weintegrated Multilingual BERT as a text encoder with image encoders likeResNet50, EfficientNet0, Vision Transformer (ViT), and Tiny Swin Transformer.Our study found that models like EfficientNet0 and Tiny Swin Transformerperform best on the datasets they were trained on, such as COCO, Flickr30k, andFlickr8k. Augmentation techniques boosted EfficientNet0 MAP on Flickr30k from0.84 to 0.87 and ResNet50 MAP on MSCOCO from 0.70 to 0.80, contributing to anew state of the art in vision-language retrieval. We share our configurationsand results to support further research. Code and pre-trained models areavailable at https://github.com/aliasgerovs/azclip.

Quick Read (beta)

loading the full paper ...