From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM

Abstract

Large language models (LLMs) have shown remarkable performance andgeneralization capabilities across multiple languages and tasks, making themvery attractive targets for multi-modality integration (e.g., images orspeech). In this work, we extend an existing LLM to the speech modality viaspeech discretization and continued pre-training. In particular, we areinterested in multilingual LLMs, such as TOWER, as their pre-training settingallows us to treat discretized speech input as an additional translationlanguage. The resulting open-source model, SPIRE, is able to transcribe andtranslate English speech input while maintaining TOWER's original performanceon translation-related tasks, showcasing that discretized speech inputintegration as an additional language is feasible during LLM adaptation. Wemake our code and models available to the community.

Quick Read (beta)

loading the full paper ...