Abstract
The speech tokenizer plays a crucial role in recent speech tasks, generallyserving as a bridge between speech signals and language models. Whilelow-frame-rate codecs are widely employed as speech tokenizers, the impact offrame rates on speech tokens remains underexplored. In this study, weinvestigate how varying frame rates affect speech tokenization by examiningMandarin and English, two typologically distinct languages. We encode speech atdifferent frame rates and evaluate the resulting semantic tokens in the speechrecognition task. Our findings reveal that frame rate variations influencespeech tokenization differently for each language, highlighting the interplaybetween frame rates, phonetic density, and language-specific acoustic features.The results provide insights into optimizing frame rate selection for speechtokenizers, with implications for automatic speech recognition, text-to-speech,and other speech-related applications.