Abstract
While recent multilingual automatic speech recognition models claim tosupport thousands of languages, ASR for low-resource languages remains highlyunreliable due to limited bimodal speech and text training data. Bettermultilingual spoken language understanding (SLU) can strengthen massively therobustness of multilingual ASR by levering language semantics to compensate forscarce training data, such as disambiguating utterances via context orexploiting semantic similarities across languages. Even more so, SLU isindispensable for inclusive speech technology in roughly half of all livinglanguages that lack a formal writing system. However, the evaluation ofmultilingual SLU remains limited to shallower tasks such as intentclassification or language identification. To address this, we presentFleurs-SLU, a multilingual SLU benchmark that encompasses topical speechclassification in 102 languages and multiple-choice question answering throughlistening comprehension in 92 languages. We extensively evaluate bothend-to-end speech classification models and cascaded systems that combinespeech-to-text transcription with subsequent classification by large languagemodels on Fleurs-SLU. Our results show that cascaded systems exhibit greaterrobustness in multilingual SLU tasks, though speech encoders can achievecompetitive performance in topical speech classification when appropriatelypre-trained. We further find a strong correlation between robust multilingualASR, effective speech-to-text translation, and strong multilingual SLU,highlighting the mutual benefits between acoustic and semantic speechrepresentations.