Abstract
Large language models (LLMs) have shown remarkable performance andgeneralization capabilities across multiple languages and tasks, making themvery attractive targets for multi-modality integration (e.g., images orspeech). In this work, we extend an existing LLM to the speech modality viaspeech discretization and continued pre-training. In particular, we areinterested in multilingual LLMs, such as TOWER, as their pre-training settingallows us to treat discretized speech input as an additional translationlanguage. The resulting open-source model, SPIRE, is able to transcribe andtranslate English speech input while maintaining TOWER's original performanceon translation-related tasks, showcasing that discretized speech inputintegration as an additional language is feasible during LLM adaptation. Wemake our code and models available to the community.