Abstract
Speech data has rich acoustic and paralinguistic information with importantcues for understanding a speaker's tone, emotion, and intent, yet traditionallarge language models such as BERT do not incorporate this information. Therehas been an increased interest in multi-modal language models leveraging audioand/or visual information and text. However, current multi-modal languagemodels require both text and audio/visual data streams during inference/testtime. In this work, we propose a methodology for training language modelsleveraging spoken language audio data but without requiring the audio streamduring prediction time. This leads to an improved language model for analyzingspoken transcripts while avoiding an audio processing overhead at test time. Weachieve this via an audio-language knowledge distillation framework, where wetransfer acoustic and paralinguistic information from a pre-trained speechembedding (OpenAI Whisper) teacher model to help train a student language modelon an audio-text dataset. In our experiments, the student model achievesconsistent improvement over traditional language models on tasks analyzingspoken transcripts.