Abstract
The rapid advancement of Large Language Models (LLMs), particularly thosetrained on multilingual corpora, has intensified the need for a deeperunderstanding of their performance across a diverse range of languages andmodel sizes. Our research addresses this critical need by studying theperformance and scaling behavior of multilingual LLMs in text classificationand machine translation tasks across 204 languages. We systematically examineboth seen and unseen languages across three model families of varying sizes inzero-shot and few-shot settings. Our findings show significant differences inscaling behavior between zero-shot and two-shot scenarios, with strikingdisparities in performance between seen and unseen languages. Model scale haslittle effect on zero-shot performance, which remains mostly flat. However, intwo-shot settings, larger models show clear linear improvements in multilingualtext classification. For translation tasks, however, only the instruction-tunedmodel showed clear benefits from scaling. Our analysis also suggests thatoverall resource levels, not just the proportions of pretraining languages, arebetter predictors of model performance, shedding light on what drivesmultilingual LLM effectiveness.