Abstract
Rapid advancements in Visual Language Models (VLMs) have transformedmultimodal understanding but are often constrained by generating Englishresponses regardless of the input language. This phenomenon has been termed asImage-induced Fidelity Loss (IFL) and stems from limited multimodalmultilingual training data. To address this, we propose a continuousmultilingual integration strategy that injects text-only multilingual dataduring visual instruction tuning, preserving the language model's originalmultilingual capabilities. Extensive evaluations demonstrate that our approachsignificantly improves linguistic fidelity across languages without degradationin visual performance. We also explore model merging, which improves languagefidelity but comes at the cost of visual performance. In contrast, our coremethod achieves robust multilingual alignment without trade-offs, offering ascalable and effective path to mitigating IFL for global VLM adoption.