Rethinking Multilingual Vision-Language Translation: Dataset, Evaluation, and Adaptation

Abstract

Vision-Language Translation (VLT) is a challenging task that requiresaccurately recognizing multilingual text embedded in images and translating itinto the target language with the support of visual context. While recent LargeVision-Language Models (LVLMs) have demonstrated strong multilingual and visualunderstanding capabilities, there is a lack of systematic evaluation andunderstanding of their performance on VLT. In this work, we present acomprehensive study of VLT from three key perspectives: data quality, modelarchitecture, and evaluation metrics. (1) We identify critical limitations inexisting datasets, particularly in semantic and cultural fidelity, andintroduce AibTrans -- a multilingual, parallel, human-verified dataset withOCR-corrected annotations. (2) We benchmark 11 commercial LVLMs/LLMs and 6state-of-the-art open-source models across end-to-end and cascadedarchitectures, revealing their OCR dependency and contrasting generation versusreasoning behaviors. (3) We propose Density-Aware Evaluation to address metricreliability issues under varying contextual complexity, introducing the DAScore as a more robust measure of translation quality. Building upon thesefindings, we establish a new evaluation benchmark for VLT. Notably, we observethat fine-tuning LVLMs on high-resource language pairs degrades cross-lingualperformance, and we propose a balanced multilingual fine-tuning strategy thateffectively adapts LVLMs to VLT without sacrificing their generalizationability.

Quick Read (beta)

loading the full paper ...