Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning

  • 2024-02-09 18:51:49
  • Shivalika Singh, Freddie Vargus, Daniel Dsouza, Börje F. Karlsson, Abinaya Mahendiran, Wei-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura OMahony, Mike Zhang, Ramith Hettiarachchi, Joseph Wilson, Marina Machado, Luisa Souza Moura, Dominik Krzemiński, Hakimeh Fadaei, Irem Ergün, Ifeoma Okoh, Aisha Alaagib, Oshan Mudannayake, Zaid Alyafeai, Vu Minh Chien, Sebastian Ruder, Surya Guthikonda, Emad A. Alghamdi, Sebastian Gehrmann, Niklas Muennighoff, Max Bartolo, Julia Kreutzer, Ahmet Üstün, Marzieh Fadaee, Sara Hooker
  • 0

Abstract

Datasets are foundational to many breakthroughs in modern artificialintelligence. Many recent achievements in the space of natural languageprocessing (NLP) can be attributed to the finetuning of pre-trained models on adiverse set of tasks that enables a large language model (LLM) to respond toinstructions. Instruction fine-tuning (IFT) requires specifically constructedand annotated datasets. However, existing datasets are almost all in theEnglish language. In this work, our primary goal is to bridge the language gapby building a human-curated instruction-following dataset spanning 65languages. We worked with fluent speakers of languages from around the world tocollect natural instances of instructions and completions. Furthermore, wecreate the most extensive multilingual collection to date, comprising 513million instances through templating and translating existing datasets across114 languages. In total, we contribute four key resources: we develop andopen-source the Aya Annotation Platform, the Aya Dataset, the Aya Collection,and the Aya Evaluation Suite. The Aya initiative also serves as a valuable casestudy in participatory research, involving collaborators from 119 countries. Wesee this as a valuable framework for future research collaborations that aim tobridge gaps in resources.

 

Quick Read (beta)

loading the full paper ...