Abstract
Multi-task benchmarks such as GLUE and SuperGLUE have driven great progressof pretraining and transfer learning in Natural Language Processing (NLP).These benchmarks mostly focus on a range of Natural Language Understanding(NLU) tasks, without considering the Natural Language Generation (NLG) models.In this paper, we present the General Language Generation Evaluation (GLGE), anew multi-task benchmark for evaluating the generalization capabilities of NLGmodels across eight language generation tasks. For each task, we continue todesign three subtasks in terms of task difficulty (GLGE-Easy, GLGE-Medium, andGLGE-Hard). This introduces 24 subtasks to comprehensively compare modelperformance. To encourage research on pretraining and transfer learning on NLGmodels, we make GLGE publicly available and build a leaderboard with strongbaselines including MASS, BART, and ProphetNet (The source code and dataset arepublicly available at https://github.com/microsoft/glge).