Abstract
Spoken language identification refers to the task of automatically predictingthe spoken language in a given utterance. Conventionally, it is modeled as aspeech-based language identification task. Prior techniques have beenconstrained to a single modality; however in the case of video data there is awealth of other metadata that may be beneficial for this task. In this work, wepropose MuSeLI, a Multimodal Spoken Language Identification method, whichdelves into the use of various metadata sources to enhance languageidentification. Our study reveals that metadata such as video title,description and geographic location provide substantial information to identifythe spoken language of the multimedia recording. We conduct experiments usingtwo diverse public datasets of YouTube videos, and obtain state-of-the-artresults on the language identification task. We additionally conduct anablation study that describes the distinct contribution of each modality forlanguage recognition.