Abstract
There has been increasing interest in building multilingual foundation modelsfor NLP and speech research. This paper examines how to expand the speechtranslation capability of these models with restricted data. Whisper, a speechfoundation model with strong performance on speech recognition and Englishtranslation, is used as the example model. Using speech-to-speech retrieval toanalyse the audio representations generated by the encoder, we show thatutterances from different languages are mapped to a shared semantic space. Thisshared embedding space can then be leveraged for zero-shot cross-lingualtransfer in speech translation. By fine-tuning the Whisper decoder with onlyEnglish-to-Chinese speech translation data, improved performance fortranslation to Chinese can be obtained for multiple languages, in addition toEnglish. Furthermore, for languages related to those seen in training it ispossible to perform speech translation, despite the model never seeing thelanguage in training, or being able to perform transcription.