Abstract
Despite recent progress in generative adversarial network(GAN)-basedvocoders, where the model generates raw waveform conditioned on melspectrogram, it is still challenging to synthesize high-fidelity audio fornumerous speakers across varied recording environments. In this work, wepresent BigVGAN, a universal vocoder that generalizes well under various unseenconditions in zero-shot setting. We introduce periodic nonlinearities andanti-aliased representation into the generator, which brings the desiredinductive bias for waveform synthesis and significantly improves audio quality.Based on our improved generator and the state-of-the-art discriminators, wetrain our GAN vocoder at the largest scale up to 112M parameters, which isunprecedented in the literature. In particular, we identify and address thetraining instabilities specific to such scale, while maintaining high-fidelityoutput without over-regularization. Our BigVGAN achieves the state-of-the-artzero-shot performance for various out-of-distribution scenarios, including newspeakers, novel languages, singing voices, music and instrumental audio inunseen (even noisy) recording environments. We will release our code and modelat: https://github.com/NVIDIA/BigVGAN