Abstract
We consider the generative modeling of speech over multiple minutes, arequirement for long-form multimedia generation and audio-native voiceassistants. However, current spoken language models struggle to generateplausible speech past tens of seconds, from high temporal resolution of speechtokens causing loss of coherence, to architectural issues with long-sequencetraining or extrapolation, to memory costs at inference time. With theseconsiderations we propose SpeechSSM, the first speech language model to learnfrom and sample long-form spoken audio (e.g., 16 minutes of read orextemporaneous speech) in a single decoding session without text intermediates,based on recent advances in linear-time sequence modeling. Furthermore, toaddress growing challenges in spoken language evaluation, especially in thisnew long-form setting, we propose: new embedding-based and LLM-judged metrics;quality measurements over length and time; and a new benchmark for long-formspeech processing and generation, LibriSpeech-Long. Speech samples and thedataset are released athttps://google.github.io/tacotron/publications/speechssm/