Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model

Abstract

Most Large Vision-Language Models (LVLMs) to date are trained predominantlyon English data, which makes them struggle to understand non-English input andfail to generate output in the desired target language. Existing effortsmitigate these issues by adding multilingual training data, but do so in alargely ad-hoc manner, lacking insight into how different training mixes tipthe scale for different groups of languages. In this work, we present acomprehensive investigation into the training strategies for massivelymultilingual LVLMs. First, we conduct a series of multi-stage experimentsspanning 13 downstream vision-language tasks and 43 languages, systematicallyexamining: (1) the number of training languages that can be included withoutdegrading English performance and (2) optimal language distributions ofpre-training as well as (3) instruction-tuning data. Further, we (4)investigate how to improve multilingual text-in-image understanding, andintroduce a new benchmark for the task. Surprisingly, our analysis reveals thatone can (i) include as many as 100 training languages simultaneously (ii) withas little as 25-50\% of non-English data, to greatly improve multilingualperformance while retaining strong English performance. We further find that(iii) including non-English OCR data in pre-training and instruction-tuning isparamount for improving multilingual text-in-image understanding. Finally, weput all our findings together and train Centurio, a 100-language LVLM, offeringstate-of-the-art performance in an evaluation covering 14 tasks and 56languages.

Quick Read (beta)

loading the full paper ...