Abstract
While it is nearly effortless for humans to quickly assess the perceptualsimilarity between two images, the underlying processes are thought to be quitecomplex. Despite this, the most widely used perceptual metrics today, such asPSNR and SSIM, are simple, shallow functions, and fail to account for manynuances of human perception. Recently, the deep learning community has foundthat features of the VGG network trained on the ImageNet classification taskhas been remarkably useful as a training loss for image synthesis. But howperceptual are these so-called "perceptual losses"? What elements are criticalfor their success? To answer these questions, we introduce a new Full ReferenceImage Quality Assessment (FR-IQA) dataset of perceptual human judgments, ordersof magnitude larger than previous datasets. We systematically evaluate deepfeatures across different architectures and tasks and compare them with classicmetrics. We find that deep features outperform all previous metrics by hugemargins. More surprisingly, this result is not restricted to ImageNet-trainedVGG features, but holds across different deep architectures and levels ofsupervision (supervised, self-supervised, or even unsupervised). Our resultssuggest that perceptual similarity is an emergent property shared across deepvisual representations.