RAGRank: Using PageRank to Counter Poisoning in CTI LLM Pipelines

Abstract

Retrieval-Augmented Generation (RAG) has emerged as the dominantarchitectural pattern to operationalize Large Language Model (LLM) usage inCyber Threat Intelligence (CTI) systems. However, this design is susceptible topoisoning attacks, and previously proposed defenses can fail for CTI contextsas cyber threat information is often completely new for emerging attacks, andsophisticated threat actors can mimic legitimate formats, terminology, andstylistic conventions. To address this issue, we propose that the robustness ofmodern RAG defenses can be accelerated by applying source credibilityalgorithms on corpora, using PageRank as an example. In our experiments, wedemonstrate quantitatively that our algorithm applies a lower authority scoreto malicious documents while promoting trusted content, using the standardizedMS MARCO dataset. We also demonstrate proof-of-concept performance of ouralgorithm on CTI documents and feeds.

Quick Read (beta)

loading the full paper ...