Abstract
This paper proposes a new voice conversion (VC) task from human speech todog-like speech while preserving linguistic information as an example of humanto non-human creature voice conversion (H2NH-VC) tasks. Although most VCstudies deal with human to human VC, H2NH-VC aims to convert human speech intonon-human creature-like speech. Non-parallel VC allows us to develop H2NH-VC,because we cannot collect a parallel dataset that non-human creatures speakhuman language. In this study, we propose to use dogs as an example of anon-human creature target domain and define the "speak like a dog" task. Toclarify the possibilities and characteristics of the "speak like a dog" task,we conducted a comparative experiment using existing representativenon-parallel VC methods in acoustic features (Mel-cepstral coefficients andMel-spectrograms), network architectures (five different kernel-size settings),and training criteria (variational autoencoder (VAE)- based and generativeadversarial network-based). Finally, the converted voices were evaluated usingmean opinion scores: dog-likeness, sound quality and intelligibility, andcharacter error rate (CER). The experiment showed that the employment of theMel-spectrogram improved the dog-likeness of the converted speech, while it ischallenging to preserve linguistic information. Challenges and limitations ofthe current VC methods for H2NH-VC are highlighted.