Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning

Abstract

Datasets are foundational to many breakthroughs in modern artificialintelligence. Many recent achievements in the space of natural languageprocessing (NLP) can be attributed to the finetuning of pre-trained models on adiverse set of tasks that enables a large language model (LLM) to respond toinstructions. Instruction fine-tuning (IFT) requires specifically constructedand annotated datasets. However, existing datasets are almost all in theEnglish language. In this work, our primary goal is to bridge the language gapby building a human-curated instruction-following dataset spanning 65languages. We worked with fluent speakers of languages from around the world tocollect natural instances of instructions and completions. Furthermore, wecreate the most extensive multilingual collection to date, comprising 513million instances through templating and translating existing datasets across114 languages. In total, we contribute four key resources: we develop andopen-source the Aya Annotation Platform, the Aya Dataset, the Aya Collection,and the Aya Evaluation Suite. The Aya initiative also serves as a valuable casestudy in participatory research, involving collaborators from 119 countries. Wesee this as a valuable framework for future research collaborations that aim tobridge gaps in resources.

Quick Read (beta)

loading the full paper ...