Abstract
In this work, we demonstrate that multilingual large-scalesequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoisingand Causal Language Modeling (CLM) tasks, are more efficient few-shot learnersthan decoder-only models on various tasks. In particular, we train a 20 billionparameter multilingual seq2seq model called Alexa Teacher Model (AlexaTM 20B)and show that it achieves state-of-the-art (SOTA) performance on 1-shotsummarization tasks, outperforming a much larger 540B PaLM decoder model.AlexaTM 20B also achieves SOTA in 1-shot machine translation, especially forlow-resource languages, across almost all language pairs supported by the model(Arabic, English, French, German, Hindi, Italian, Japanese, Marathi,Portuguese, Spanish, Tamil, and Telugu) on Flores-101 dataset. We also show inzero-shot setting, AlexaTM 20B outperforms GPT3 (175B) on SuperGLUE and SQuADv2datasets and provides SOTA performance on multilingual tasks such as XNLI,XCOPA, Paws-X, and XWinograd. Overall, our results present a compelling casefor seq2seq models as a powerful alternative to decoder-only models forLarge-scale Language Model (LLM) training.