A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software

  • 2019-02-07 12:47:13
  • Serena E. Ponta, Henrik Plate, Antonino Sabetta, Michele Bezzi, C├ędric Dangremont
  • 26

Abstract

Advancing our understanding of software vulnerabilities, automating theiridentification, the analysis of their impact, and ultimately their mitigationis necessary to enable the development of software that is more secure. Whileoperating a vulnerability assessment tool that we developed and that iscurrently used by hundreds of development units at SAP, we manually collectedand curated a dataset of vulnerabilities of open-source software and thecommits fixing them. The data was obtained both from the National VulnerabilityDatabase (NVD) and from project-specific Web resources that we monitor on acontinuous basis. From that data, we extracted a dataset that maps 624 publiclydisclosed vulnerabilities affecting 205 distinct open-source Java projects,used in SAP products or internal tools, onto the 1282 commits that fix them.Out of 624 vulnerabilities, 29 do not have a CVE identifier at all and 46,which do have a CVE identifier assigned by a numbering authority, are notavailable in the NVD yet. The dataset is released under an open-source license,together with supporting scripts that allow researchers to automaticallyretrieve the actual content of the commits from the corresponding repositoriesand to augment the attributes available for each instance. Also, these scriptsallow to complement the dataset with additional instances that are not securityfixes (which is useful, for example, in machine learning applications). Ourdataset has been successfully used to train classifiers that couldautomatically identify security-relevant commits in code repositories. Therelease of this dataset and the supporting code as open-source will allowfuture research to be based on data of industrial relevance; also, itrepresents a concrete step towards making the maintenance of this dataset ashared effort involving open-source communities, academia, and the industry.

 

Introduction (beta)

None

 

Conclusion (beta)

None