WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation

Abstract

A recurring challenge of crowdsourcing NLP datasets at scale is that humanwriters often rely on repetitive patterns when crafting examples, leading to alack of linguistic diversity. We introduce a novel paradigm for datasetcreation based on human and machine collaboration, which brings together thegenerative strength of language models and the evaluative strength of humans.Starting with an existing dataset, MultiNLI, our approach uses datasetcartography to automatically identify examples that demonstrate challengingreasoning patterns, and instructs GPT-3 to compose new examples with similarpatterns. Machine generated examples are then automatically filtered, andfinally revised and labeled by human crowdworkers to ensure quality. Theresulting dataset, WANLI, consists of 108,357 natural language inference (NLI)examples that present unique empirical strengths over existing NLI datasets.Remarkably, training a model on WANLI instead of MNLI (which is 4 times larger)improves performance on seven out-of-domain test sets we consider, including by11% on HANS and 9% on Adversarial NLI. Moreover, combining MNLI with WANLI ismore effective than combining with other augmentation sets that have beenintroduced. Our results demonstrate the potential of natural languagegeneration techniques to curate NLP datasets of enhanced quality and diversity.

Quick Read (beta)

loading the full paper ...