COCO-Urdu: A Large-Scale Urdu Image-Caption Dataset with Multimodal Quality Estimation

Abstract

Urdu, spoken by over 250 million people, remains critically under-served inmultimodal and vision-language research. The absence of large-scale,high-quality datasets has limited the development of Urdu-capable systems andreinforced biases in multilingual vision-language models trained primarily onhigh-resource languages. To address this gap, we present COCO-Urdu, alarge-scale image-caption dataset derived from MS COCO, containing 59,000images and 319,000 Urdu captions selected through stratified sampling topreserve the original distribution. Captions were translated using SeamlessM4Tv2 and validated with a hybrid multimodal quality estimation framework thatintegrates COMET-Kiwi for translation quality, CLIP-based similarity for visualgrounding, and BERTScore with back-translation for semantic consistency;low-scoring captions were iteratively refined using open-source large languagemodels. We further benchmark COCO-Urdu on BLEU, SacreBLEU, and chrF, reportingconsistently strong results. To the best of our knowledge, COCO-Urdu is thelargest publicly available Urdu captioning dataset. By releasing both thedataset and the quality estimation pipeline, we aim to reduce language bias inmultimodal research and establish a foundation for inclusive vision-languagesystems.

Quick Read (beta)

loading the full paper ...