GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

  • 2022-06-22 18:52:30
  • Sebastian Gehrmann, Abhik Bhattacharjee, Abinaya Mahendiran, Alex Wang, Alexandros Papangelis, Aman Madaan, Angelina McMillan-Major, Anna Shvets, Ashish Upadhyay, Bingsheng Yao, Bryan Wilie, Chandra Bhagavatula, Chaobin You, Craig Thomson, Cristina Garbacea, Dakuo Wang, Daniel Deutsch, Deyi Xiong, Di Jin, Dimitra Gkatzia, Dragomir Radev, Elizabeth Clark, Esin Durmus, Faisal Ladhak, Filip Ginter, Genta Indra Winata, Hendrik Strobelt, Hiroaki Hayashi, Jekaterina Novikova, Jenna Kanerva, Jenny Chim, Jiawei Zhou, Jordan Clive, Joshua Maynez, João Sedoc, Juraj Juraska, Kaustubh Dhole, Khyathi Raghavi Chandu, Leonardo F. R. Ribeiro, Lewis Tunstall, Li Zhang, Mahima Pushkarna, Mathias Creutz, Michael White, Mihir Sanjay Kale, Moussa Kamal Eddine, Nico Daheim, Nishant Subramani, Ondrej Dusek, Paul
  • 5

Abstract

Evaluation in machine learning is usually informed by past choices, forexample which datasets or metrics to use. This standardization enables thecomparison on equal footing using leaderboards, but the evaluation choicesbecome sub-optimal as better alternatives arise. This problem is especiallypertinent in natural language generation which requires ever-improving suitesof datasets, metrics, and human evaluation to make definitive claims. To makefollowing best model evaluation practices easier, we introduce GEMv2. The newversion of the Generation, Evaluation, and Metrics Benchmark introduces amodular infrastructure for dataset, model, and metric developers to benefitfrom each others work. GEMv2 supports 40 documented datasets in 51 languages.Models for all datasets can be evaluated online and our interactive data cardcreation and rendering tools make it easier to add new datasets to the livingbenchmark.

 

Quick Read (beta)

loading the full paper ...