Efficiently Identifying Low-Quality Language Subsets in Multilingual Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset

Abstract

Curating datasets that span multiple languages is challenging. To make thecollection more scalable, researchers often incorporate one or more imperfectclassifiers in the process, like language identification models. These models,however, are prone to failure, resulting in some language subsets beingunreliable for downstream tasks. We introduce a statistical test, thePreference Proportion Test, for identifying such unreliable subsets. Byannotating only 20 samples for a language subset, we're able to identifysystematic transcription errors for 10 language subsets in a recent largemultilingual transcribed audio dataset, X-IPAPack (Zhu et al., 2024). We findthat filtering this low-quality data out when training models for thedownstream task of phonetic transcription brings substantial benefits, mostnotably a 25.7% relative improvement on transcribing recordings inout-of-distribution languages. Our method lays a path forward for systematicand reliable multilingual dataset auditing.

Quick Read (beta)

loading the full paper ...