Abstract
In this paper our goal is to convert a set of spoken lines into sung ones.Unlike previous signal processing based methods, we take a learning basedapproach to the problem. This allows us to automatically model various aspectsof this transformation, thus overcoming dependence on specific inputs such ashigh quality singing templates or phoneme-score synchronization information.Specifically, we propose an encoder--decoder framework for our task. Giventime-frequency representations of speech and a target melody contour, we learnencodings that enable us to synthesize singing that preserves the linguisticcontent and timbre of the speaker while adhering to the target melody. We alsopropose a multi-task learning based objective to improve lyric intelligibility.We present a quantitative and qualitative analysis of our framework.