Abstract
Text-to-Speech (TTS) models can generate natural, human-like speech acrossmultiple languages by transforming phonemes into waveforms. However,multilingual TTS remains challenging due to discrepancies in phonemevocabularies and variations in prosody and speaking style across languages.Existing approaches either train separate models for each language, whichachieve high performance at the cost of increased computational resources, oruse a unified model for multiple languages that struggles to capturefine-grained, language-specific style variations. In this work, we proposeLanStyleTTS, a non-autoregressive, language-aware style adaptive TTS frameworkthat standardizes phoneme representations and enables fine-grained,phoneme-level style control across languages. This design supports a unifiedmultilingual TTS model capable of producing accurate and high-quality speechwithout the need to train language-specific models. We evaluate LanStyleTTS byintegrating it with several state-of-the-art non-autoregressive TTSarchitectures. Results show consistent performance improvements acrossdifferent model backbones. Furthermore, we investigate a range of acousticfeature representations, including mel-spectrograms and autoencoder-derivedlatent features. Our experiments demonstrate that latent encodings cansignificantly reduce model size and computational cost while preservinghigh-quality speech generation.