Scaling End-to-End Models for Large-Scale Multilingual ASR

Abstract

Building ASR models across many language families is a challenging multi-tasklearning problem due to large language variations and heavily unbalanced data.Existing work has shown positive transfer from high resource to low resourcelanguages. However, degradations on high resource languages are commonlyobserved due to interference from the heterogeneous multilingual data andreduction in per-language capacity. We conduct a capacity study on a15-language task, with the amount of data per language varying from 7.7K to54.7K hours. We adopt GShard [1] to efficiently scale up to 10B parameters.Empirically, we find that (1) scaling the number of model parameters is aneffective way to solve the capacity bottleneck - our 500M-param model isalready better than monolingual baselines and scaling it to 1B and 10B broughtfurther quality gains; (2) larger models are not only more data efficient, butalso more efficient in terms of training cost as measured in TPU days - the1B-param model reaches the same accuracy at 34% of training time as the500M-param model; (3) given a fixed capacity budget, adding depth usually worksbetter than width and large encoders tend to do better than large decoders.

Quick Read (beta)

loading the full paper ...