What Catches the Eye? Visualizing and Understanding Deep Saliency Models

Abstract

Deep convolutional neural networks have demonstrated high performances forfixation prediction in recent years. How they achieve this, however, is lessexplored and they remain to be black box models. Here, we attempt to shed lighton the internal structure of deep saliency models and study what features theyextract for fixation prediction. Specifically, we use a simple yet powerfularchitecture, consisting of only one CNN and a single resolution input,combined with a new loss function for pixel-wise fixation prediction duringfree viewing of natural scenes. We show that our simple method is on par orbetter than state-of-the-art complicated saliency models. Furthermore, wepropose a method, related to saliency model evaluation metrics, to visualizedeep models for fixation prediction. Our method reveals the innerrepresentations of deep models for fixation prediction and provides evidencethat saliency, as experienced by humans, is likely to involve high-levelsemantic knowledge in addition to low-level perceptual cues. Our results can beuseful to measure the gap between current saliency models and the humaninter-observer model and to build new models to close this gap.

Quick Read (beta)

loading the full paper ...