Extending Isolation Forest for Anomaly Detection in Big Data via K-Means

  • 2021-04-27 16:21:48
  • Md Tahmid Rahman Laskar, Jimmy Huang, Vladan Smetana, Chris Stewart, Kees Pouw, Aijun An, Stephen Chan, Lei Liu
  • 3

Abstract

Industrial Information Technology (IT) infrastructures are often vulnerableto cyberattacks. To ensure security to the computer systems in an industrialenvironment, it is required to build effective intrusion detection systems tomonitor the cyber-physical systems (e.g., computer networks) in the industryfor malicious activities. This paper aims to build such intrusion detectionsystems to protect the computer networks from cyberattacks. More specifically,we propose a novel unsupervised machine learning approach that combines theK-Means algorithm with the Isolation Forest for anomaly detection in industrialbig data scenarios. Since our objective is to build the intrusion detectionsystem for the big data scenario in the industrial domain, we utilize theApache Spark framework to implement our proposed model which was trained inlarge network traffic data (about 123 million instances of network traffic)stored in Elasticsearch. Moreover, we evaluate our proposed model on the livestreaming data and find that our proposed system can be used for real-timeanomaly detection in the industrial setup. In addition, we address differentchallenges that we face while training our model on large datasets andexplicitly describe how these issues were resolved. Based on our empiricalevaluation in different use-cases for anomaly detection in real-world networktraffic data, we observe that our proposed system is effective to detectanomalies in big data scenarios. Finally, we evaluate our proposed model onseveral academic datasets to compare with other models and find that itprovides comparable performance with other state-of-the-art approaches.

 

Quick Read (beta)

loading the full paper ...