Robustness Gym: Unifying the NLP Evaluation Landscape

Abstract

Despite impressive performance on standard benchmarks, deep neural networksare often brittle when deployed in real-world systems. Consequently, recentresearch has focused on testing the robustness of such models, resulting in adiverse set of evaluation methodologies ranging from adversarial attacks torule-based data transformations. In this work, we identify challenges withevaluating NLP systems and propose a solution in the form of Robustness Gym(RG), a simple and extensible evaluation toolkit that unifies 4 standardevaluation paradigms: subpopulations, transformations, evaluation sets, andadversarial attacks. By providing a common platform for evaluation, RobustnessGym enables practitioners to compare results from all 4 evaluation paradigmswith just a few clicks, and to easily develop and share novel evaluationmethods using a built-in set of abstractions. To validate Robustness Gym'sutility to practitioners, we conducted a real-world case study with asentiment-modeling team, revealing performance degradations of 18%+. To verifythat Robustness Gym can aid novel research analyses, we perform the first studyof state-of-the-art commercial and academic named entity linking (NEL) systems,as well as a fine-grained analysis of state-of-the-art summarization models.For NEL, commercial systems struggle to link rare entities and lag theiracademic counterparts by 10%+, while state-of-the-art summarization modelsstruggle on examples that require abstraction and distillation, degrading by9%+. Robustness Gym can be found at https://robustnessgym.com/

Quick Read (beta)

loading the full paper ...