Abstract
Most current zero-shot voice conversion methods rely on externally supervisedcomponents, particularly speaker encoders, for training. To explorealternatives that eliminate this dependency, this paper introduces GenVC, anovel framework that disentangles speaker identity and linguistic content fromspeech signals in a self-supervised manner. GenVC leverages speech tokenizersand an autoregressive, Transformer-based language model as its backbone forspeech generation. This design supports large-scale training while enhancingboth source speaker privacy protection and target speaker cloning fidelity.Experimental results demonstrate that GenVC achieves notably higher speakersimilarity, with naturalness on par with leading zero-shot approaches.Moreover, due to its autoregressive formulation, GenVC introduces flexibilityin temporal alignment, reducing the preservation of source prosody andspeaker-specific traits, and making it highly effective for voiceanonymization.