When Does Perceptual Alignment Benefit Vision Representations?

Abstract

Humans judge perceptual similarity according to diverse visual attributes,including scene layout, subject location, and camera pose. Existing visionmodels understand a wide range of semantic abstractions but improperly weighthese attributes and thus make inferences misaligned with human perception.While vision representations have previously benefited from alignment incontexts like image generation, the utility of perceptually alignedrepresentations in more general-purpose settings remains unclear. Here, weinvestigate how aligning vision model representations to human perceptualjudgments impacts their usability across diverse computer vision tasks. Wefinetune state-of-the-art models on human similarity judgments for imagetriplets and evaluate them across standard vision benchmarks. We find thataligning models to perceptual judgments yields representations that improveupon the original backbones across many downstream tasks, including counting,segmentation, depth estimation, instance retrieval, and retrieval-augmentedgeneration. In addition, we find that performance is widely preserved on othertasks, including specialized out-of-distribution domains such as in medicalimaging and 3D environment frames. Our results suggest that injecting aninductive bias about human perceptual knowledge into vision models cancontribute to better representations.

Quick Read (beta)

loading the full paper ...