Abstract
We introduce AnyEnhance, a unified generative model for voice enhancementthat processes both speech and singing voices. Based on a masked generativemodel, AnyEnhance is capable of handling both speech and singing voices,supporting a wide range of enhancement tasks including denoising,dereverberation, declipping, super-resolution, and target speaker extraction,all simultaneously and without fine-tuning. AnyEnhance introduces aprompt-guidance mechanism for in-context learning, which allows the model tonatively accept a reference speaker's timbre. In this way, it could boostenhancement performance when a reference audio is available and enable thetarget speaker extraction task without altering the underlying architecture.Moreover, we also introduce a self-critic mechanism into the generative processfor masked generative models, yielding higher-quality outputs through iterativeself-assessment and refinement. Extensive experiments on various enhancementtasks demonstrate AnyEnhance outperforms existing methods in terms of bothobjective metrics and subjective listening tests. Demo audios are publiclyavailable at https://amphionspace.github.io/anyenhance. An open-sourceimplementation is provided athttps://github.com/viewfinder-annn/anyenhance-v1-ccf-aatc.