KaraSinger: Score-Free Singing Voice Synthesis with VQ-VAE using Mel-spectrograms

Abstract

In this paper, we propose a novel neural network model called KaraSinger fora less-studied singing voice synthesis (SVS) task named score-free SVS, inwhich the prosody and melody are spontaneously decided by machine. KaraSingercomprises a vector-quantized variational autoencoder (VQ-VAE) that compressesthe Mel-spectrograms of singing audio to sequences of discrete codes, and alanguage model (LM) that learns to predict the discrete codes given thecorresponding lyrics. For the VQ-VAE part, we employ a Connectionist TemporalClassification (CTC) loss to encourage the discrete codes to carryphoneme-related information. For the LM part, we use location-sensitiveattention for learning a robust alignment between the input phoneme sequenceand the output discrete code. We keep the architecture of both the VQ-VAE andLM light-weight for fast training and inference speed. We validate theeffectiveness of the proposed design choices using a proprietary collection of550 English pop songs sung by multiple amateur singers. The result of alistening test shows that KaraSinger achieves high scores in intelligibility,musicality, and the overall quality.

Quick Read (beta)

loading the full paper ...