AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model

  • 2022-08-02 14:30:07
  • Saleh Soltan, Shankar Ananthakrishnan, Jack FitzGerald, Rahul Gupta, Wael Hamza, Haidar Khan, Charith Peris, Stephen Rawls, Andy Rosenbaum, Anna Rumshisky, Chandana Satya Prakash, Mukund Sridhar, Fabian Triefenbach, Apurv Verma, Gokhan Tur, Prem Natarajan
  • 16

Abstract

In this work, we demonstrate that multilingual large-scalesequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoisingand Causal Language Modeling (CLM) tasks, are more efficient few-shot learnersthan decoder-only models on various tasks. In particular, we train a 20 billionparameter multilingual seq2seq model called Alexa Teacher Model (AlexaTM 20B)and show that it achieves state-of-the-art (SOTA) performance on 1-shotsummarization tasks, outperforming a much larger 540B PaLM decoder model.AlexaTM 20B also achieves SOTA in 1-shot machine translation, especially forlow-resource languages, across almost all language pairs supported by the model(Arabic, English, French, German, Hindi, Italian, Japanese, Marathi,Portuguese, Spanish, Tamil, and Telugu) on Flores-101 dataset. We also show inzero-shot setting, AlexaTM 20B outperforms GPT3 (175B) on SuperGLUE and SQuADv2datasets and provides SOTA performance on multilingual tasks such as XNLI,XCOPA, Paws-X, and XWinograd. Overall, our results present a compelling casefor seq2seq models as a powerful alternative to decoder-only models forLarge-scale Language Model (LLM) training.

 

Quick Read (beta)

loading the full paper ...