Abstract
In this paper, we propose a novel neural network model called KaraSinger fora less-studied singing voice synthesis (SVS) task named score-free SVS, inwhich the prosody and melody are spontaneously decided by machine. KaraSingercomprises a vector-quantized variational autoencoder (VQ-VAE) that compressesthe Mel-spectrograms of singing audio to sequences of discrete codes, and alanguage model (LM) that learns to predict the discrete codes given thecorresponding lyrics. For the VQ-VAE part, we employ a Connectionist TemporalClassification (CTC) loss to encourage the discrete codes to carryphoneme-related information. For the LM part, we use location-sensitiveattention for learning a robust alignment between the input phoneme sequenceand the output discrete code. We keep the architecture of both the VQ-VAE andLM light-weight for fast training and inference speed. We validate theeffectiveness of the proposed design choices using a proprietary collection of550 English pop songs sung by multiple amateur singers. The result of alistening test shows that KaraSinger achieves high scores in intelligibility,musicality, and the overall quality.