Abstract
We argue that training autoencoders to reconstruct inputs from noisedversions of their encodings, when combined with perceptual losses, yieldsencodings that are structured according to a perceptual hierarchy. Wedemonstrate the emergence of this hierarchical structure by showing that, aftertraining an audio autoencoder in this manner, perceptually salient informationis captured in coarser representation structures than with conventionaltraining. Furthermore, we show that such perceptual hierarchies improve latentdiffusion decoding in the context of estimating surprisal in music pitches andpredicting EEG-brain responses to music listening. Pretrained weights areavailable on github.com/CPJKU/pa-audioic.
Quick Read (beta)
loading the full paper ...