Unsupervised Any-to-Many Audiovisual Synthesis via Exemplar Autoencoders

  • 2020-01-13 18:56:45
  • Kangle Deng, Aayush Bansal, Deva Ramanan
  • 4


We present an unsupervised approach that enables us to convert the speechinput of any one individual to an output set of potentially-infinitely manyspeakers. One can stand in front of a mic and be able to make their favoritecelebrity say the same words. Our approach builds on simple autoencoders thatproject out-of-sample data to the distribution of the training set (motivatedby PCA/linear autoencoders). We use an exemplar autoencoder to learn the voiceand specific style (emotions and ambiance) of a target speaker. In contrast toexisting methods, the proposed approach can be easily extended to anarbitrarily large number of speakers in a very little time using only two-threeminutes of audio data from a speaker. We also exhibit the usefulness of ourapproach for generating video from audio signals and vice-versa. We suggest thereader to check out our project webpage for various synthesized examples:https://dunbar12138.github.io/projectpage/Audiovisual/


