Automated Vulnerability Detection in Source Code Using Deep Representation Learning

Abstract

Increasing numbers of software vulnerabilities are discovered every yearwhether they are reported publicly or discovered internally in proprietarycode. These vulnerabilities can pose serious risk of exploit and result insystem compromise, information leaks, or denial of service. We leveraged thewealth of C and C++ open-source code available to develop a large-scalefunction-level vulnerability detection system using machine learning. Tosupplement existing labeled vulnerability datasets, we compiled a vast datasetof millions of open-source functions and labeled it with carefully-selectedfindings from three different static analyzers that indicate potentialexploits. Using these datasets, we developed a fast and scalable vulnerabilitydetection tool based on deep feature representation learning that directlyinterprets lexed source code. We evaluated our tool on code from both realsoftware packages and the NIST SATE IV benchmark dataset. Our resultsdemonstrate that deep feature representation learning on source code is apromising approach for automated software vulnerability detection.

Quick Read (beta)

loading the full paper ...