Florenz: Scaling Laws for Systematic Generalization in Vision-Language Models

Abstract

Cross-lingual transfer enables vision-language models (VLMs) to performvision tasks in various languages with training data only in one language.Current approaches rely on large pre-trained multilingual language models.However, they face the curse of multilinguality, sacrificing downstream taskperformance for multilingual capabilities, struggling with lexical ambiguities,and falling behind recent advances. In this work, we study the scaling laws ofsystematic generalization with monolingual VLMs for multilingual tasks,focusing on the impact of model size and seen training samples. We proposeFlorenz, a monolingual encoder-decoder VLM with 0.4B to 11.2B parameterscombining the pre-trained VLM Florence-2 and the large language model Gemma-2.Florenz is trained with varying compute budgets on a synthetic dataset thatfeatures intentionally incomplete language coverage for image captioning, thus,testing generalization from the fully covered translation task. We show thatnot only does indirectly learning unseen task-language pairs adhere to ascaling law, but also that with our data generation pipeline and the proposedFlorenz model family, image captioning abilities can emerge in a specificlanguage even when only data for the translation task is available. Fine-tuningon a mix of downstream datasets yields competitive performance and demonstratespromising scaling trends in multimodal machine translation (Multi30K, CoMMuTE),lexical disambiguation (CoMMuTE), and image captioning (Multi30K, XM3600, COCOKarpathy).

Quick Read (beta)

loading the full paper ...