A Variational Framework for Improving Naturalness in Generative Spoken Language Models

Abstract

The success of large language models in text processing has inspired theiradaptation to speech modeling. However, since speech is continuous and complex,it is often discretized for autoregressive modeling. Speech tokens derived fromself-supervised models (known as semantic tokens) typically focus on thelinguistic aspects of speech but neglect prosodic information. As a result,models trained on these tokens can generate speech with reduced naturalness.Existing approaches try to fix this by adding pitch features to the semantictokens. However, pitch alone cannot fully represent the range of paralinguisticattributes, and selecting the right features requires careful hand-engineering.To overcome this, we propose an end-to-end variational approach thatautomatically learns to encode these continuous speech attributes to enhancethe semantic tokens. Our approach eliminates the need for manual extraction andselection of paralinguistic features. Moreover, it produces preferred speechcontinuations according to human raters. Code, samples and models are availableat https://github.com/b04901014/vae-gslm.

Quick Read (beta)

loading the full paper ...