On Designing Machine Learning Models for Malicious Network Traffic Classification

  • 2019-07-10 17:50:34
  • Talha Ongun, Timothy Sakharaov, Simona Boboila, Alina Oprea, Tina Eliassi-Rad
  • 2

Abstract

Machine learning (ML) started to become widely deployed in cyber securitysettings for shortening the detection cycle of cyber attacks. To date, mostML-based systems are either proprietary or make specific choices of featurerepresentations and machine learning models. The success of these techniques isdifficult to assess as public benchmark datasets are currently unavailable. Inthis paper, we provide concrete guidelines and recommendations for usingsupervised ML in cyber security. As a case study, we consider the problem ofbotnet detection from network traffic data. Among our findings we highlightthat: (1) feature representations should take into consideration attackcharacteristics; (2) ensemble models are well-suited to handle class imbalance;(3) the granularity of ground truth plays an important role in the success ofthese methods.

 

Quick Read (beta)

On Designing Machine Learning Models for Malicious Network Traffic Classification

Talha Ongun Northeastern University Timothy Sakharaov Northeastern University Simona Boboila Northeastern University Alina Oprea Northeastern University Tina Eliassi-Rad Northeastern University
Abstract

Machine learning (ML) started to become widely deployed in cyber security settings for shortening the detection cycle of cyber attacks. To date, most ML-based systems are either proprietary or make specific choices of feature representations and machine learning models. The success of these techniques is difficult to assess as public benchmark datasets are currently unavailable. In this paper, we provide concrete guidelines and recommendations for using supervised ML in cyber security. As a case study, we consider the problem of botnet detection from network traffic data. Among our findings we highlight that: (1) feature representations should take into consideration attack characteristics; (2) ensemble models are well-suited to handle class imbalance; (3) the granularity of ground truth plays an important role in the success of these methods.

1 Introduction

A wide spectrum of threats ranging from opportunistic malicious activities to sophisticated nation-sponsored campaigns threaten organizations from industry, academia, and government. These attacks usually result in loss of important information and affect consumers and businesses alike. Notable examples are the Equifax data breach in 2017 and the Anthem healthcare campaign in 2015 that compromised personal financial and medical records for millions of US citizens.

To date, most enterprises deploy many security controls in their environments and apply best practice (such as patching vulnerable systems, use of threat intelligence services, and endpoint scanning) to protect against cyber threats. Monitoring tools are deployed in most organizations either on the network (e.g., network intrusion-detection systems, web proxies, firewalls) or on the end hosts (e.g., anti-virus software, endpoint agents). With the availability of security logs collected by large enterprises, machine learning (ML) started to become an important defensive tool in face of increasingly sophisticated cyber attacks. ML techniques applied to network data include systems for detecting malicious domains (e.g., [1, 5, 2]), methods for detecting malware delivery (e.g., [9]) or command-and-control communication [4, 11, 8, 12], techniques for detecting malicious web pages (e.g., [15]), and various industry products for enterprise threat detection (e.g., [13, 6, 10, 7, 16]).

ML has a lot of potential in shortening the malware detection cycle, but these algorithms tend to come with a number of shortcomings. In particular, Sommer and Paxson [14] highlighted the difficulties of using ML in operational settings for cyber security. The main limitations they identified were: (1) ML excels at supervised tasks by learning from labeled examples, while in cyber security most of the data is unlabeled. (2) ML errors (and in particular false positives) have high cost as alerts need to be investigated by security analysts. (3) Network traffic exhibits high diversity under normal operating conditions. (4) Performing sound evaluations is usually challenging due to unavailability of standard benchmark datasets.

In this paper, we describe some concrete guidelines and recommendations for using supervised ML in cyber security. As a case study, we consider the problem of botnet detection from network traffic data. We leverage a public dataset (CTU-13) which includes network traffic collected from a university campus and attacks launched on the university network. Among our findings, we highlight the following:

  • Feature representations should take into consideration the specifics of the attacks. Among standard feature representations, we compare connection-level features (extracted directly from Bro logs) with aggregated traffic statistics and temporal features (using fixed time windows).

  • Class imbalance is a major issue that hinders the performance of simple linear models such as logistic regression.

  • Ensemble methods such as gradient boosting have built-in techniques that can handle class imbalance well. They achieve better performance at classifying malicious and benign connections compared to linear models.

  • The granularity of data labeling (ground truth) can impact the classification metrics substantially. If available, ground truth obtained at the level of individual network connections can boost the performance of supervised ML models.

2 Background and Threat Model

2.1 Machine Learning for Network Traffic Classification

Network Intrusion Detection is a highly active area of research. Traditional systems such as Snort are based on manually-generated rules for detecting well-known malware variants.

Recently, ML has proven to be valuable in augmenting rule-based systems. ML has the potential of detecting more advanced malicious activities that evade rule-based systems. Successful applications of ML to various types of network data for malware detection include:

  • Domain reputation systems using passive DNS data, such as Notos [1] and EXPOSURE [5].

  • Command-and-control detection based on NetFlow data, such as DISCLOSURE [4] and BotFinder [17].

  • Malicious communication detection using web-proxy logs, such as ExecScent [11], BAYWATCH [8], and MADE [12].

Bro is an open-source network monitoring agent that collects a number of network logs. Here we leverage the Bro connection logs, which record the fields included in Figure 1. These include the TCP connection timestamp, duration, source IP and port, destination IP and port, number of packets sent and received, number of bytes sent and received, and connection state. For UDP, an entry is generated for every UDP packet (as there do not exist connections over UDP).

Figure 1: Fields in Bro connection log.

2.2 Problem statement and threat model

ML algorithms have demonstrated success in network traffic classification tasks for detecting botnets or malicious domains. However, most ML methods are designed in an ad-hoc manner and guidelines for principled approaches in this space are currently missing. We are interested in filling this gap and providing recommendations on several general principles that should guide ML design for botnet and malware detection. We are specifically addressing the problem of detecting botnets from network logs (as generated by Bro logs), but our methods can be used with other network data types (such as NetFlow, pcap, firewalls). Some of the research questions we would like to answer are the following:

  • Can raw network data be used effectively in an ML algorithm?

  • Which feature representations are most appropriate for applying ML classification algorithms?

  • Which classifiers achieve best performance in handling the largely imbalanced cyber-security datasets?

  • What is the impact of labeling the data for ground truth generation?

We assume that the monitoring agent, which collects the network data, is not under the attacker’s control. We also assume that the attacker cannot tamper with the collected network logs. Therefore, attackers do not have access to the storage device where data is recorded. 11 1 Attackers with access to the monitoring environment and the system logs are much more powerful, and are beyond our current scope.

3 Case Study on ML for Botnet Detection

Figure 2: Overview of the system architecture.

3.1 Dataset

We leverage a dataset of botnet traffic that was captured in 2011 at the CTU University in the Czech Republic. The dataset includes 13 scenarios, each including legitimate traffic, as well as various attacks such as spam, port scanning, DDOS, and click fraud. The dataset also includes a list of botnet IPs that can be used for labeling the traffic.

Since ML classification needs to use similar attack data for training and testing, we decided to use a subset of 6 scenarios. Among these, 3 scenarios are generated by botnet Neris (performing spam and click fraud activity), and 3 scenarios are generated by botnet Rbot (performing DDoS activity). The statistics are in Table 1. For other botnets, there was only one scenario available and that precluded the use of supervised ML.

In traditional ML, cross-validation is a well-known method to evaluate the generalization of a model. k-fold cross-validation splits the data into k partitions at random, trains a model on k-1 of them and evaluates it on the k-th partition. Splitting the logs at random produces highly-correlated data between training and testing sets. Instead, we train on two scenarios, and test on the third (independent) scenario, repeating the experiment 3 times for each of the two botnets. We have thus assurances that testing data is independent from training. This method of splitting the data into training and testing (based on independent attack scenario) is more appropriate for this setting. In other contexts, the specifics of the environment need to be taken into consideration.


Botnet
Scenario Attack Botnet Botnet Background Background
raw aggregated raw aggregated
Neris 1 Spam, click fraud 31,089 569 3,067,241 76,614
2 Spam, click fraud 39,730 407 1,872,270 54,675
9 Spam, click fraud 111,895 2893 1,689,040 62,970
port scan
Rbot 4 ICMP, UDP 126,438 122 869,648 49,041
10 ICMP, UDP 10,102,210 741 988,870 55,160
11 UDP 251,814 20 75,069 3169
Table 1: CTU-13 botnet scenarios.

3.2 Overview

We show our system architecture in Figure 2. Our system processes network logs collected at the border of an organization (i.e., campus or enterprise network). After data collection, a feature extraction layer is employed to prepare the data for ML training. A number of classification algorithms are used to train a classifier and optimize for standard metrics, such as precision, recall, F1 score, and AUC. The classifiers are applied to new testing scenarios in order to evaluate their generality and predict suspicious network activity. We believe that this framework is general enough to be applicable in other environments.

3.3 Feature extraction

We experiment with different feature representations, as described below.

Connection-level representation. This representation extracts features directly from the raw connection logs. We consider all connections in which 𝚒𝚙 is either 𝚒𝚍.𝚘𝚛𝚒𝚐_𝚑 or 𝚒𝚍.𝚍𝚎𝚜𝚝_𝚑 and we use directly the fields from the Bro connection logs as features:

𝚝𝚜,𝚒𝚍.𝚘𝚛𝚒𝚐_𝚑,𝚒𝚍.𝚘𝚛𝚒𝚐_𝚙,𝚒𝚍.𝚍𝚎𝚜𝚝_𝚑,
𝚒𝚍.𝚍𝚎𝚜𝚝_𝚙,𝚙𝚛𝚘𝚝𝚘,𝚍𝚞𝚛𝚊𝚝𝚒𝚘𝚗,𝚘𝚛𝚒𝚐_𝚋𝚢𝚝𝚎𝚜,
𝚛𝚎𝚜𝚙_𝚋𝚢𝚝𝚎𝚜,𝚘𝚛𝚒𝚐_𝚙𝚔𝚝𝚜,𝚛𝚎𝚜𝚙_𝚙𝚔𝚝𝚜

For categorical features (e.g., 𝚙𝚛𝚘𝚝𝚘) we use standard one-hot encoding. In this representation, we obtained 26 features after one-hot encoding.

Category Field Operator Definition
IPs 𝚒𝚍.𝚍𝚎𝚜𝚝_𝚑 Distinct Number of IPs communicated with per port
(Per port) Distinct Number of Subnets communicated with per port
Duration 𝚍𝚞𝚛𝚊𝚝𝚒𝚘𝚗 Sum Total duration of connection per port
(Per port) Min Min duration of connection per port
Max Max duration of connection per port
Bytes 𝚘𝚛𝚒𝚐_𝚋𝚢𝚝𝚎𝚜 Sum Total bytes sent by 𝚒𝚙 per port
(Per port) Min Min bytes sent by 𝚒𝚙 in a connection per port
Max Max bytes sent by 𝚒𝚙 in a connection per port
𝚛𝚎𝚜𝚙_𝚋𝚢𝚝𝚎𝚜 Sum Total bytes received by 𝚒𝚙 per port
Min Min bytes received by 𝚒𝚙 in a connection per port
Max Max bytes received by 𝚒𝚙 in a connection per port
Packets 𝚘𝚛𝚒𝚐_𝚙𝚔𝚝𝚜 Sum Total packets sent by 𝚒𝚙 per port
(Per port) Min Min packets sent by 𝚒𝚙 in a connection per port
Max Max packets sent by 𝚒𝚙 in a connection per port
𝚛𝚎𝚜𝚙_𝚙𝚔𝚝𝚜 Sum Total packets received by 𝚒𝚙 per port
Min Min packets received by 𝚒𝚙 in a connection per port
Max Max packets received by 𝚒𝚙 in a connection per port
Traffic statistics 𝚙𝚛𝚘𝚝𝚘 Sum Number of connections per transport protocol (TCP, UDP, ICMP)
𝚒𝚍.𝚘𝚛𝚒𝚐_𝚙 Distinct Number of source ports
𝚒𝚍.𝚍𝚎𝚜𝚝_𝚑 Distinct Number of external destination IPs
𝚒𝚍.𝚍𝚎𝚜𝚝_𝚙 Distinct Number of destination ports
Table 2: Traffic features aggregated by time. The top 4 categories of features are defined per port.

Aggregated traffic statistics. Next, we would like to explore if features obtained by time aggregation are more powerful than raw features. We consider a time interval of length T over which we define aggregated features over all connections in which 𝚒𝚙 is either 𝚒𝚍.𝚘𝚛𝚒𝚐_𝚑 or 𝚒𝚍.𝚍𝚎𝚜𝚝_𝚑.

An important consideration when defining our features is to generate a fixed number of features, independent of the traffic at a particular host. In our first attempt, we consider the set of all destination IP addresses that 𝚒𝚙 communicates with: SIP={IP1,,IPn}. From these we can define the set of /24 destination subnets that 𝚒𝚙 communicates with: Ssubnet={Sub1,,Subm}, with mn. If we define aggregated features per destination or subnet, we will encounter an issue when a host visits new IPs or new destinations. In that case, we need to add new features to our representation, which is not desirable in practice.

To alleviate this problem, we define our aggregated features by destination port (corresponding to applications or network services). Specifically, we define a set of 17 popular application ports (e.g., HTTP - 80, HTTPS - 443, SSH - 22, DNS - 53). We then take a modular approach. We select a small number of operators (Distinct, Sum, Min, Max) and apply them to fields in conn.log for each destination port. The features are described in Table 2. We generate these features separately for outgoing and incoming connections. Additionally, we add some features that capture communication patterns with external IP destinations (e.g., number of connections per transport protocol, number of source and destination ports, number of destination IPs, etc.). In this representation, we obtain 756 aggregated traffic features.

Temporal features. Considering the same time interval T as with the aggregated connection-level features, we define inter-arrival features on a node as the mean, standard deviation, median, minimum, and maximum of the time distribution between node communications. Each internal node has two such sets of features: one for events where the node serves as the source of communication (outgoing), and one where it is the target (incoming). These communications are aggregated by common ports. Thus, in each time interval T, a node i will have the inter-arrival features listed in Table 3. In this representation, we obtain 180 features.

Category Statistics Definition
Outgoing Mean, std. dev., median, min, max Statistics of inter-arrival distribution for outgoing traffic
Incoming Mean, std. dev., median, min, max Statistics of inter-arrival distribution for incoming traffic
Table 3: Temporal features aggregated by time. Each of these features is defined per port.

3.4 ML classification and labeling

Ground truth labeling CTU-13 dataset provides a list of botnet IP addresses. One of our main observations is that the attack is not active during the duration of the entire data collection. We found that the granularity at which we label the data plays a large role in the results. We experiment with two levels of granularity:

  • Coarse-grained labeling: We label all the connection logs generated by the botnet IPs as Malicious during the entire scenario period.

  • Fine-grained labeling: For the Rbot attack (an instance of DDoS), we obtain the IP address of the victim machine. We use that to identify the attack flows that connect to the victim IP. For all feature representations, we label a time window as Malicious if there is at least one attack log event in that time window.

Fine-grained labeling is difficult to obtain in general because it is a manual process, but when it is available it improves significantly the performance of ML in botnet detection.

ML models We consider several well-known ML classification models, including logistic regression, random forest, and gradient boosting. We use several metrics to evaluate the performance of the ML algorithms (precision, recall, F1 score, and AUC). As the imbalance is quite large in this dataset (the ratio of Malicious to Legitimate samples is as low as 1:134 for Neris and 1:401 for Rbot with features aggregated at 30-second intervals), the accuracy is always quite high (above 0.96 in all our experiments). We are interested in results on the minority (Malicious) class, thus precision, recall, F1 score, and AUC are better indicators of how the classifiers perform at detecting botnets.

For the ML classifiers, we perform a grid search on several hyper-parameters to select the models performing best in our setting. For Random Forest, we selected the number of trees in {10,50,100,200} and found that 100 tree worked best. For Gradient Boosting, we varied the number of estimators in {50,100,200}, the maximum depth of each tree in {3,5,7} and learning rate in {0.01,0.05,0.1}. We selected 100 estimators with maximum depth of 3 and learning rate 0.05. For logistic regression, we used L1 or Lasso regularization to reduce the space dimension.

4 Experimental Evaluation

During our experimental evaluation, we would like to answer several research questions, which we detail below.

Which feature representation performs best? We compare different feature representations (connection-level representation, aggregated traffic statistics, and temporal features). For this experiment, we use a random forest classifier with 100 trees and a 30-second time window for aggregation.

The results for Neris are in Table 4 and they show that aggregated features (both traffic statistics and temporal) perform significantly better than raw features extracted directly from Bro logs at all metrics of interest. For instance, when training on scenarios 2 and 9 and testing on scenario 1, the F1 score for connection features is 0.65, while the F1 score for aggregated features is 0.98. We do not observe a major difference when we consider both traffic and timing features, compared to using only aggregated traffic features.

The results for Rbot for fine-grained labeling are in Table 5. Here, connection-based features perform quite well. The reason is that this is a DDoS attack in which all packets sent to the victim are identical. However, traffic statistics and temporal features also perform well. The exception is when training on scenarios 4 and 11, and testing on scenario 10. In that case, the amount of botnet samples used for training with 30-second aggregation is very small (142), while there are much more botnet samples in the raw data (378,252).


Features
Training Testing Prec. Recall F1 AUC
Scenarios Scenario
Connection 2,9 1 0.68 0.62 0.65 0.87
1,9 2 0.89 0.43 0.58 0.88
1,2 9 0.92 0.70 0.80 0.94
Traffic 2,9 1 0.99 0.98 0.98 0.99
1,9 2 0.94 0.96 0.95 0.99
1,2 9 1 0.90 0.94 0.96
Traffic and 2,9 1 0.99 0.97 0.98 0.99
Temporal 1,9 2 0.95 0.96 0.95 0.98
1,2 9 1 0.90 0.94 0.96

Table 4: Classification metrics for the Neris botnet for Random Forest with different feature representations. Best results are highlighted in bold.

Features
Training Testing Prec. Recall F1 AUC
Scenarios Scenario
Connection 10,11 4 0.99 0.99 0.99 0.99
4,11 10 0.99 0.99 0.99 0.99
4,10 11 0.99 0.99 0.99 0.99
Traffic 10,11 4 1 1 1 1
4,11 10 1 0.85 0.92 0.92
4,10 11 1 1 1 1
Traffic and 10,11 4 1 1 1 1
Temporal 4,11 10 1 0.85 0.92 0.92
4,10 11 1 1 1 1
Table 5: Classification metrics for the Rbot botnet for Random Forest with different feature representations and fine-grained labeling. Best results are highlighted in bold.
Figure 3: F1 scores for different time windows for Neris.

Time
Training Testing Prec. Recall F1 AUC
(seconds) Scenarios Scenario
1 2,9 1 0.89 0.87 0.88 0.98
1,9 2 0.92 0.87 0.90 0.98
1,2 9 0.98 0.89 0.93 0.98
10 2,9 1 0.84 0.98 0.91 0.99
1,9 2 0.96 0.96 0.96 0.99
1,2 9 1 0.92 0.96 0.98
30 2,9 1 0.99 0.97 0.98 0.99
1,9 2 0.95 0.96 0.95 0.98
1,2 9 1 0.90 0.94 0.96
60 2,9 1 0.99 0.97 0.98 0.99
1,9 2 0.95 0.96 0.95 0.98
1,2 9 1 0.87 0.93 0.95
120 2,9 1 0.97 0.97 0.97 0.99
1,9 2 0.91 0.94 0.92 0.98
1,2 9 0.99 0.82 0.90 0.92
240 2,9 1 0.94 0.97 0.95 0.99
1,9 2 0.85 0.92 0.89 0.97
1,2 9 1 0.75 0.85 0.89
600 2,9 1 0.87 1 0.93 0.99
1,9 2 0.76 0.90 0.83 0.99
1,2 9 1 0.59 0.74 0.82

Table 6: Classification metrics for the Neris botnet for Random Forest with different time windows. We used the aggregated traffic statistics and temporal features. Best results are highlighted in bold.

What is the impact of varying the time window? Here, we validate the choice of the time window for aggregation. Table 6 and Figure 3 show results for varying the time window from 1 to 600 seconds. The 30-second and 60-second time windows exhibit similar results and they are performing well most of the time. Window size 10 is also performing well, except when testing on scenario 1. As the time window increases beyond 120 seconds, the results start to degrade. We suspect this is because of the small samples of attack traffic at larger aggregation windows, as well as additional noise in the legitimate traffic. In general, selecting the best time window for aggregation is attack-dependent. We recommend the use of cross-validation for selecting the optimal value of the time window. Based on these results, we select a time window of 30 seconds for the rest of experiments.

What is the impact of different ML models? One important observation is that the amount of imbalance in cyber security is very large (as also observed by previous work [3, 12]). It is well-known that ensemble classifiers such as random forests and boosting handle imbalance much better than simpler models. We test this hypotheses by using three different classifiers for our task: logistic regression, random forests, and gradient boosting. We fix the aggregation time window to 30 seconds and use the traffic statistics and temporal features.

The results for three classifiers for Neris are in Table 7 and the precision-recall curves are in Figure 4. All three models we experimented with perform relatively well. Both ensemble method perform better than the logistic regression model, with F1 scores reaching between 0.94 and 0.98 on all scenarios. The difference between random forest and gradient boosting is imperceptible, they are both powerful classification models.

Figure 4: Precision-recall curves for three classifiers for Neris.

Model
Training Testing Prec. Recall F1 AUC
Scenarios Scenario
Logistic 2,9 1 0.90 0.90 0.90 0.94
Regression 1,9 2 0.98 0.95 0.97 0.99
1,2 9 0.97 0.87 0.92 0.96
Random 2,9 1 0.99 0.97 0.98 0.99
Forest 1,9 2 0.95 0.96 0.95 0.98
1,2 9 1 0.90 0.94 0.96
Gradient 2,9 1 1 0.97 0.98 0.99
boosting 1,9 2 1 0.92 0.96 0.99
1,2 9 1 0.87 0.93 0.95

Table 7: Classification metrics for the Neris botnet for three classifiers for aggregated traffic statistics and temporal features (aggregation window 30 seconds). Best results are highlighted in bold.

Are the models interpretable? To understand what the ML models learned, we computed feature importance for the random forest classifier for both Neris and Rbot (using the aggregated traffic statistics and timing features at 30-second window). The results are in Table 8. Interestingly, we observe that the classifier identifies features that are correlated with the attack. Neris is a spam botnet and most of its activity uses port 25, making features such as distinct source ports and median inter-arrival packet time on port 25 most relevant. In contrast, Rbot is a DDoS botnet that uses different ports for the attack. For instance, the UDP flood is using port 161, and the classifier correctly determines that the standard deviation of inter-arrival packet timing on port 161 is the most important feature.

These results show our framework’s flexibility and ability to generalize to different attack patterns. We defined a set of 936 generic features that can be used for a variety of botnet attacks. For the two different botnets we experimented with, the ML models identified the most relevant features that are correlated with the attacks, without the need for a human expert to explicitly locate those features. Models such as random forest provide standard metrics for feature importance, with a clear advantage for model interpretability compared to deep learning and neural networks that lack interpretability. Interpretability is important in cyber security, as most of the time human experts analyze the alerts of ML systems.

Botnet Feature Port Importance
Neris Distinct source ports 25 0.085
Median inter-arrival time 25 0.070
Distinct destination ports 25 0.067
Min packets sent 25 0.061
Distinct external IPs 25 0.054
Total duration 25 0.053
Total packets sent 25 0.051
Max duration of connection 25 0.048
Rbot Std. dev. of inter-arrival time 161 0.049
Distinct source ports 135 0.046
Distinct source ports Other 0.043
Min inter-arrival timing 138 0.042
Distinct source ports 138 0.040
Distinct source ports 3 0.038
Distinct source ports 8 0.030
Std. dev. of inter-arrival time 138 0.029
Table 8: Feature importance for the Neris botnet (top) and the Rbot botnet (bottom). All these features are for outgoing connections from an internal node.

What is the impact of labeling flows accurately? We perform an experiment to test how the granularity of data labeling impacts the classification results. For the Rbot DDoS botnet we have access to the IP address of the victim machine and thus we can determine which connections are botnet-related. We refer to fine-grained labeling to the process of labeling only the botnet connection to victim IP as Malicious. We refer to coarse-grained labeling to the process of labeling all connections initiated by the botnet IP as Malicious.

Table 9 shows the results of fine-grained and coarse-grained labeling for the Random Forest and Gradient Boosting classifiers for features aggregated at 30-second intervals. The results demonstrate that classifier performance obtained with fine-grained labeling is much better than using coarse-grained labeling. For instance, when training on scenarios 10 and 11, and testing on scenario 4, the F1 score for coarse-grained labeling is 0.44, compared to a perfect F1 score for fine-grained labeling. Both classifiers perform here similarly for fine-grained labeling.


Model
Label Training Testing Prec. Recall F1 AUC
Scenarios Scenario
Random C 10,11 4 0.75 0.31 0.44 0.94
Forest 4,11 10 0.99 0.76 0.86 0.90
4,10 11 0.93 0.75 0.83 0.91

Random
F 10,11 4 1 1 1 1
Forest 4,11 10 1 0.85 0.92 0.92
4,10 11 1 1 1 1

Gradient
C 10,11 4 1 0.27 0.42 0.84
boosting 4,11 10 0.99 0.74 0.84 0.95
4,10 11 1 0.75 0.85 0.92

Gradient
F 10,11 4 1 1 1 1
boosting 4,11 10 1 0.85 0.92 0.92
4,10 11 1 1 1 1

Table 9: Classification metrics for the Rbot botnet for two methods of labeling the data (coarse-grained or C and fine-grained or F). We used the Random Forest (100 trees) and Gradient Boosting classifiers for aggregated traffic statistics and temporal features (aggregation window 30 seconds). Best results are highlighted in bold.

5 Lessons and General Recommendations

Motivated by our case study of botnet classification from Bro logs, we highlight several guidelines that we believe are applicable in other settings where ML is used in cyber security.

Multiple feature representations need to be evaluated. Features extracted directly from raw data such as Bro connection logs do not always results in the most optimal representation. A representation that worked well in our setting for classifying internal IP addresses is feature aggregation by time windows and port number. We also observed that feature representation depends on the amount of training data available. With the large imbalance between the malicious and benign classes, smaller time windows work better for aggregation. However, the right feature representation and the choice of time window for feature aggregation are dependent on the attack type. We recommend evaluating multiple feature representations.

Model interpretability. Models that provide interpretability are preferred in cyber security as security analysts need to investigate the alerts raised by ML systems. Understanding why a flow is labeled as malicious can speed up the investigation significantly. We showed how a random forest classifier is interpretable by identifying most relevant features that clearly provide insights about the botnet activity.

Data imbalance raises a challenge for supervised learning. Data imbalance results in a huge challenge when applying classification methods to cyber security. Simpler models such as linear models are not equipped to deal well with class imbalance. We showed that ensemble models such as random forest and gradient boosting achieve good results even in highly imbalanced scenario, compared to logistic regression. For instance, at an imbalance of 1:134 (when testing on scenario 2 for Neris) we obtain 0.97 precision and 0.95 recall with gradient boosting.

The alternative to classification is to employ anomaly-detection models that learn from the legitimate class and identify attacks as anomalies. Nevertheless, Sommer and Paxson [14] discussed extensively the difficulty of using anomaly detection in cyber security. We plan to investigate the performance of anomaly detectors in future work.

Fine-grained ground truth labeling can be a major factor in the success of supervised learning. As we demonstrated, data labeling for generating the ground truth plays a major factor in measuring the success of supervised learning algorithms. If detailed information about the attack is available (e.g., the destination IPs contacted by attacker), then the performance of classifiers can be greatly improved. However, it is difficult most of the time to identify exactly the attack flows, even when running cotrolled attack simulations. Malware can contact a variety of IP addresses using different protocols, but infected machines also generate a fair number of legitimate connections (e.g., connections to Window updates).

Acknowledgements

The research reported in this document/presentation was performed in connection with contract number W911NF-18-C-0019 with the U.S. Army Contracting Command - Aberdeen Proving Ground (ACC-APG) and the Defense Advanced Research Projects Agency (DARPA). The views and conclusions contained in this document/presentation are those of the authors and should not be interpreted as presenting the official policies or position, either expressed or implied, of ACC-APG, DARPA, or the U.S. Government unless so designated by other authorized documents. Citation of manufacturer’s or trade names does not constitute an official endorsement or approval of the use thereof. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon.

We thank Malathi Veeraraghavan, Jack Davidson, Alastair Nottingham, and Donald Brown from University of Virginia, Kolia Sadeghi from Commonwealth Computer Research, Inc., and other PCORE-project team members for their support of this work. We would also like to thank Vijay Sarvepalli, Andrew J Kompanek, and Lena Pons from the Software Engineering Institute at Carnegie Mellon University for their helpful feedback regarding the evaluation.

References

  • [1] M. Antonakakis, R. Perdisci, D. Dagon, W. Lee, and N. Feamster. Building a dynamic reputation system for DNS. In Proc. 19th USENIX Security Symposium, 2010.
  • [2] M. Antonakakis, R. Perdisci, Y. Nadji, N. Vasiloglou, S. Abu-Nimeh, W. Lee, and D. Dagon. From throw-away traffic to bots: Detecting the rise of DGA-based malware. In Proc. 21st USENIX Security Symposium, 2012.
  • [3] K. Bartos, M. Sofka, and V. Franc. Optimized invariant representation of network traffic for detecting unseen malware variants. In 25th USENIX Security Symposium (USENIX Security 16), pages 807–822. USENIX Association, 2016.
  • [4] L. Bilge, D. Balzarotti, W. Robertson, E. Kirda, and C. Kruegel. DISCLOSURE: Detecting botnet Command-and-Control servers through large-scale NetFlow analysis. In Proc. 28th Annual Computer Security Applications Conference (ACSAC), ACSAC, 2012.
  • [5] L. Bilge, E. Kirda, K. Christopher, and M. Balduzzi. EXPOSURE: Finding malicious domains using passive DNS analysis. In Proc. 18th Symposium on Network and Distributed System Security, NDSS, 2011.
  • [6] Endgame. Using Deep Learning To Detect DGAs. https://www.endgame.com/blog/technical-blog/using-deep-learning-detect-dgas, 2016.
  • [7] FireEye. Reverse Engineering the Analyst: Building Machine Learning Models for the SOC. https://www.fireeye.com/blog/threat-research/2018/06/build-machine-learning-models-for-the-soc.html, 2018.
  • [8] X. Hu, J. Jang, M. P. Stoecklin, T. Wang, D. L. Schales, D. Kirat, and J. R. Rao. BAYWATCH: robust beaconing detection to identify infected hosts in large-scale enterprise networks. In DSN, pages 479–490. IEEE Computer Society, 2016.
  • [9] L. Invernizzi, S. Miskovic, R. Torres, S. Saha, S.-J. Lee, C. Kruegel, and G. Vigna. Nazca: Detecting malware distribution in large-scale networks. In Proc. ISOC Network and Distributed System Security Symposium (NDSS ’14), 2014.
  • [10] Microsoft. Machine Learning in Azure Security Center. https://azure.microsoft.com/en-us/blog/machine-learning-in-azure-security-center/, 2016.
  • [11] T. Nelms, R. Perdisci, and M. Ahamad. ExecScent: Mining for new C&C domains in live networks with adaptive control protocol templates. In Proc. 22nd USENIX Security Symposium, 2013.
  • [12] A. Oprea, Z. Li, R. Norris, and K. Bowers. MADE: Security analytics for enterprise threat detection. In Proc. Annual Computer Security Applications Conference (ACSAC), ACSAC, 2018.
  • [13] RSA. Threat Detection and Response NetWitness Platform. https://www.rsa.com/en-us/products/threat-detection-response, 2018.
  • [14] R. Sommer and V. Paxson. Outside the closed world: On using machine learning for network intrusion detection. In Proc. IEEE Symposium on Security and Privacy, SP ’10. IEEE Computer Society, 2010.
  • [15] G. Stringhini, C. Kruegel, and G. Vigna. Shady Paths: Leveraging surfing crowds to detect malicious web pages. In Proc. 20th ACM Conference on Computer and Communications Security, CCS, 2013.
  • [16] Symantec. How does Symantec Endpoint Protection use advanced machine learning? https://support.symantec.com/en_US/article.HOWTO125816.html, 2018.
  • [17] F. Tegeler, X. Fu, G. Vigna, and C. Kruegel. BotFinder: Finding bots in network traffic without deep packet inspection. In Proc. 8th International Conference on Emerging Networking Experiments and Technologies, CoNEXT, 2012.