Vision Transformers Don't Need Trained Registers

Abstract

We investigate the mechanism underlying a previously identified phenomenon inVision Transformers -- the emergence of high-norm tokens that lead to noisyattention maps. We observe that in multiple models (e.g., CLIP, DINOv2), asparse set of neurons is responsible for concentrating high-norm activations onoutlier tokens, leading to irregular attention patterns and degradingdownstream visual processing. While the existing solution for removing theseoutliers involves retraining models from scratch with additional learnedregister tokens, we use our findings to create a training-free approach tomitigate these artifacts. By shifting the high-norm activations from ourdiscovered register neurons into an additional untrained token, we can mimicthe effect of register tokens on a model already trained without registers. Wedemonstrate that our method produces cleaner attention and feature maps,enhances performance over base models across multiple downstream visual tasks,and achieves results comparable to models explicitly trained with registertokens. We then extend test-time registers to off-the-shelf vision-languagemodels to improve their interpretability. Our results suggest that test-timeregisters effectively take on the role of register tokens at test-time,offering a training-free solution for any pre-trained model released withoutthem.

Quick Read (beta)

loading the full paper ...