Text-ADBench: Text Anomaly Detection Benchmark based on LLMs Embedding

Abstract

Text anomaly detection is a critical task in natural language processing(NLP), with applications spanning fraud detection, misinformationidentification, spam detection and content moderation, etc. Despite significantadvances in large language models (LLMs) and anomaly detection algorithms, theabsence of standardized and comprehensive benchmarks for evaluating theexisting anomaly detection methods on text data limits rigorous comparison anddevelopment of innovative approaches. This work performs a comprehensiveempirical study and introduces a benchmark for text anomaly detection,leveraging embeddings from diverse pre-trained language models across a widearray of text datasets. Our work systematically evaluates the effectiveness ofembedding-based text anomaly detection by incorporating (1) early languagemodels (GloVe, BERT); (2) multiple LLMs (LLaMa-2, LLama-3, Mistral, OpenAI(small, ada, large)); (3) multi-domain text datasets (news, social media,scientific publications); (4) comprehensive evaluation metrics (AUROC, AUPRC).Our experiments reveal a critical empirical insight: embedding qualitysignificantly governs anomaly detection efficacy, and deep learning-basedapproaches demonstrate no performance advantage over conventional shallowalgorithms (e.g., KNN, Isolation Forest) when leveraging LLM-derivedembeddings.In addition, we observe strongly low-rank characteristics incross-model performance matrices, which enables an efficient strategy for rapidmodel evaluation (or embedding evaluation) and selection in practicalapplications. Furthermore, by open-sourcing our benchmark toolkit that includesall embeddings from different models and code athttps://github.com/jicongfan/Text-Anomaly-Detection-Benchmark, this workprovides a foundation for future research in robust and scalable text anomalydetection systems.

Quick Read (beta)

loading the full paper ...