Abstract
The goal of voice conversion is to transform the speech of a source speakerto sound like that of a reference speaker while preserving the originalcontent. A key challenge is to extract disentangled linguistic content from thesource and voice style from the reference. While existing approaches leveragevarious methods to isolate the two, a generalization still requires furtherattention, especially for robustness in zero-shot scenarios. In this paper, weachieve successful disentanglement of content and speaker features by tuningself-supervised speech features with adapters. The adapters are trained todynamically encode nuanced features from rich self-supervised features, and thedecoder fuses them to produce speech that accurately resembles the referencewith minimal loss of content. Moreover, we leverage a conditional flow matchingdecoder with cross-attention speaker conditioning to further boost thesynthesis quality and efficiency. Subjective and objective evaluations in azero-shot scenario demonstrate that the proposed method outperforms existingmodels in speech quality and similarity to the reference speech.