DeepSinger: Singing Voice Synthesis with Data Mined From the Web

Abstract

In this paper, we develop DeepSinger, a multi-lingual multi-singer singingvoice synthesis (SVS) system, which is built from scratch using singingtraining data mined from music websites. The pipeline of DeepSinger consists ofseveral steps, including data crawling, singing and accompaniment separation,lyrics-to-singing alignment, data filtration, and singing modeling.Specifically, we design a lyrics-to-singing alignment model to automaticallyextract the duration of each phoneme in lyrics starting from coarse-grainedsentence level to fine-grained phoneme level, and further design amulti-lingual multi-singer singing model based on a feed-forward Transformer todirectly generate linear-spectrograms from lyrics, and synthesize voices usingGriffin-Lim. DeepSinger has several advantages over previous SVS systems: 1) tothe best of our knowledge, it is the first SVS system that directly minestraining data from music websites, 2) the lyrics-to-singing alignment modelfurther avoids any human efforts for alignment labeling and greatly reduceslabeling cost, 3) the singing model based on a feed-forward Transformer issimple and efficient, by removing the complicated acoustic feature modeling inparametric synthesis and leveraging a reference encoder to capture the timbreof a singer from noisy singing data, and 4) it can synthesize singing voices inmultiple languages and multiple singers. We evaluate DeepSinger on our minedsinging dataset that consists of about 92 hours data from 89 singers on threelanguages (Chinese, Cantonese and English). The results demonstrate that withthe singing data purely mined from the Web, DeepSinger can synthesizehigh-quality singing voices in terms of both pitch accuracy and voicenaturalness (footnote: Our audio samples are shown inhttps://speechresearch.github.io/deepsinger/.)

Quick Read (beta)

loading the full paper ...