Abstract
Visual anomaly detection is critical in industrial manufacturing, buttraditional methods often rely on extensive normal datasets and custom models,limiting scalability. Recent advancements in large-scale visual-language modelshave significantly improved zero/few-shot anomaly detection. However, theseapproaches may not fully utilize hierarchical features, potentially missingnuanced details. We introduce a window self-attention mechanism based on theCLIP model, combined with learnable prompts to process multi-level featureswithin a Soldier-Offier Window self-Attention (SOWA) framework. Our method hasbeen tested on five benchmark datasets, demonstrating superior performance byleading in 18 out of 20 metrics compared to existing state-of-the-arttechniques.