What are the attackers doing now? Automating cyber threat intelligence extraction from text on pace with the changing threat landscape: A survey

Abstract

Cybersecurity researchers have contributed to the automated extraction of CTIfrom textual sources, such as threat reports and online articles, wherecyberattack strategies, procedures, and tools are described. The goal of thisarticle is to aid cybersecurity researchers understand the current techniquesused for cyberthreat intelligence extraction from text through a survey ofrelevant studies in the literature. We systematically collect "CTI extractionfrom text"-related studies from the literature and categorize the CTIextraction purposes. We propose a CTI extraction pipeline abstracted from thesestudies. We identify the data sources, techniques, and CTI sharing formatsutilized in the context of the proposed pipeline. Our work finds ten types ofextraction purposes, such as extraction indicators of compromise extraction,TTPs (tactics, techniques, procedures of attack), and cybersecurity keywords.We also identify seven types of textual sources for CTI extraction, and textualdata obtained from hacker forums, threat reports, social media posts, andonline news articles have been used by almost 90% of the studies. Naturallanguage processing along with both supervised and unsupervised machinelearning techniques such as named entity recognition, topic modelling,dependency parsing, supervised classification, and clustering are used for CTIextraction. We observe the technical challenges associated with these studiesrelated to obtaining available clean, labelled data which could assurereplication, validation, and further extension of the studies. As we find thestudies focusing on CTI information extraction from text, we advocate forbuilding upon the current CTI extraction work to help cybersecuritypractitioners with proactive decision making such as threat prioritization,automated threat modelling to utilize knowledge from past cybersecurityincidents.

Quick Read (beta)

loading the full paper ...