OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models

Abstract

Neural scaling laws offer valuable insights for designing robust sequenceprocessing architectures. While these laws have been extensively characterizedin other modalities, their behavior in speech remains comparativelyunderexplored. In this work, we introduce OWLS, an open-access, reproduciblesuite of multilingual speech recognition and translation models spanning 0.25Bto 18B parameters, with the 18B version being the largest speech model, to thebest of our knowledge. OWLS leverages up to 360K hours of public speech dataacross 150 languages, enabling a systematic investigation into how data, model,and compute scaling each influence performance in multilingual speech tasks. Weuse OWLS to derive neural scaling laws, showing how final performance can bereliably predicted when scaling. One of our key findings is that scalingenhances performance on low-resource languages/dialects, helping to mitigatebias and improve the accessibility of speech technologies. Finally, we show howOWLS can be used to power new research directions by discovering emergentabilities in large-scale speech models. Model checkpoints will be released onhttps://huggingface.co/collections/espnet/owls-scaling-laws-for-speech-recognition-and-translation-67ab7f991c194065f057ce8dfor future studies.

Quick Read (beta)

loading the full paper ...