Abstract
Zero-Shot Anomaly Detection (ZSAD) seeks to identify anomalies from arbitrarynovel categories, offering a scalable and annotation-efficient solution.Traditionally, most ZSAD works have been based on the CLIP model, whichperforms anomaly detection by calculating the similarity between visual andtext embeddings. Recently, vision foundation models such as DINOv3 havedemonstrated strong transferable representation capabilities. In this work, weare the first to adapt DINOv3 for ZSAD. However, this adaptation presents twokey challenges: (i) the domain bias between large-scale pretraining data andanomaly detection tasks leads to feature misalignment; and (ii) the inherentbias toward global semantics in pretrained representations often leads tosubtle anomalies being misinterpreted as part of the normal foreground objects,rather than being distinguished as abnormal regions. To overcome thesechallenges, we introduce AD-DINOv3, a novel vision-language multimodalframework designed for ZSAD. Specifically, we formulate anomaly detection as amultimodal contrastive learning problem, where DINOv3 is employed as the visualbackbone to extract patch tokens and a CLS token, and the CLIP text encoderprovides embeddings for both normal and abnormal prompts. To bridge the domaingap, lightweight adapters are introduced in both modalities, enabling theirrepresentations to be recalibrated for the anomaly detection task. Beyond thisbaseline alignment, we further design an Anomaly-Aware Calibration Module(AACM), which explicitly guides the CLS token to attend to anomalous regionsrather than generic foreground semantics, thereby enhancing discriminability.Extensive experiments on eight industrial and medical benchmarks demonstratethat AD-DINOv3 consistently matches or surpasses state-of-the-art methods.Thecode will be available at https://github.com/Kaisor-Yuan/AD-DINOv3.