Abstract
Speech language models refer to language models with speech processing andunderstanding capabilities. One key desirable capability for speech languagemodels is the ability to capture the intricate interdependency between contentand prosody. The existing mainstream paradigm of training speech languagemodels, which converts speech into discrete tokens before feeding them intoLLMs, is sub-optimal in learning prosody information -- we find that theresulting LLMs do not exhibit obvious emerging prosody processing capabilitiesvia pre-training alone. To overcome this, we propose ProsodyLM, whichintroduces a simple tokenization scheme amenable to learning prosody. Eachspeech utterance is first transcribed into text, followed by a sequence ofword-level prosody tokens. Compared with conventional speech tokenizationschemes, the proposed tokenization scheme retains more complete prosodyinformation, and is more understandable to text-based LLMs. We find thatProsodyLM can learn surprisingly diverse emerging prosody processingcapabilities through pre-training alone, ranging from harnessing the prosodynuances in generated speech, such as contrastive focus, understanding emotionand stress in an utterance, to maintaining prosody consistency in longcontexts.