SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models

Abstract

Current speech large language models build upon discrete speechrepresentations, which can be categorized into semantic tokens and acoustictokens. However, existing speech tokens are not specifically designed forspeech language modeling. To assess the suitability of speech tokens forbuilding speech language models, we established the first benchmark,SLMTokBench. Our results indicate that neither semantic nor acoustic tokens areideal for this purpose. Therefore, we propose SpeechTokenizer, a unified speechtokenizer for speech large language models. SpeechTokenizer adopts theEncoder-Decoder architecture with residual vector quantization (RVQ). Unifyingsemantic and acoustic tokens, SpeechTokenizer disentangles different aspects ofspeech information hierarchically across different RVQ layers. Furthermore, Weconstruct a Unified Speech Language Model (USLM) leveraging SpeechTokenizer.Experiments show that SpeechTokenizer performs comparably to EnCodec in speechreconstruction and demonstrates strong performance on the SLMTokBenchbenchmark. Also, USLM outperforms VALL-E in zero-shot Text-to-Speech tasks.Code and models are available athttps://github.com/ZhangXInFD/SpeechTokenizer/.

Quick Read (beta)

loading the full paper ...