Datasets: A Community Library for Natural Language Processing

Abstract

The scale, variety, and quantity of publicly-available NLP datasets has grownrapidly as researchers propose new tasks, larger models, and novel benchmarks.Datasets is a community library for contemporary NLP designed to support thisecosystem. Datasets aims to standardize end-user interfaces, versioning, anddocumentation, while providing a lightweight front-end that behaves similarlyfor small datasets as for internet-scale corpora. The design of the libraryincorporates a distributed, community-driven approach to adding datasets anddocumenting usage. After a year of development, the library now includes morethan 650 unique datasets, has more than 250 contributors, and has helpedsupport a variety of novel cross-dataset research projects and shared tasks.The library is available at https://github.com/huggingface/datasets.

Quick Read (beta)

loading the full paper ...