Abstract
Large-scale generative language models such as GPT-3 are competitive few-shotlearners. While these models are known to be able to jointly represent manydifferent languages, their training data is dominated by English, potentiallylimiting their cross-lingual generalization. In this work, we trainmultilingual generative language models on a corpus covering a diverse set oflanguages, and study their few- and zero-shot learning capabilities in a widerange of tasks. Our largest model with 7.5 billion parameters sets new state ofthe art in few-shot learning in more than 20 representative languages,outperforming GPT-3 of comparable size in multilingual commonsense reasoning(with +7.4% absolute accuracy improvement in 0-shot settings and +9.4% in4-shot settings) and natural language inference (+5.4% in each of 0-shot and4-shot settings). On the FLORES-101 machine translation benchmark, our modeloutperforms GPT-3 on 171 out of 182 directions with 32 training examples, whilesurpassing the official supervised baseline in 45 directions. We conduct anin-depth analysis of different multilingual prompting approaches, showing inparticular that strong few-shot learning performance across languages can beachieved via cross-lingual transfer through both templates and demonstrationexamples. Finally, we evaluate our models in social value tasks such as hatespeech detection in five languages and find it has limitations similar tocomparable sized GPT-3 models.