Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages

Abstract

While natural language processing tools have been developed extensively forsome of the world's languages, a significant portion of the world's over 7000languages are still neglected. One reason for this is that evaluation datasetsdo not yet cover a wide range of languages, including low-resource andendangered ones. We aim to address this issue by creating a text classificationdataset encompassing a large number of languages, many of which currently havelittle to no annotated data available. We leverage parallel translations of theBible to construct such a dataset by first developing applicable topics andemploying a crowdsourcing tool to collect annotated data. By annotating theEnglish side of the data and projecting the labels onto other languages throughaligned verses, we generate text classification datasets for more than 1500languages. We extensively benchmark several existing multilingual languagemodels using our dataset. To facilitate the advancement of research in thisarea, we will release our dataset and code.

Quick Read (beta)

loading the full paper ...