Cross-Language Binary-Source Code Matching with Intermediate Representations

  • 2022-01-19 05:17:02
  • Yi Gui, Yao Wan, Hongyu Zhang, Huifang Huang, Yulei Sui, Guandong Xu, Zhiyuan Shao, Hai Jin
  • 3

Abstract

Binary-source code matching plays an important role in many security andsoftware engineering related tasks such as malware detection, reverseengineering and vulnerability assessment. Currently, several approaches havebeen proposed for binary-source code matching by jointly learning theembeddings of binary code and source code in a common vector space. Despitemuch effort, existing approaches target on matching the binary code and sourcecode written in a single programming language. However, in practice, softwareapplications are often written in different programming languages to cater fordifferent requirements and computing platforms. Matching binary and source codeacross programming languages introduces additional challenges when maintainingmulti-language and multi-platform applications. To this end, this paperformulates the problem of cross-language binary-source code matching, anddevelops a new dataset for this new problem. We present a novel approach XLIR,which is a Transformer-based neural network by learning the intermediaterepresentations for both binary and source code. To validate the effectivenessof XLIR, comprehensive experiments are conducted on two tasks of cross-languagebinary-source code matching, and cross-language source-source code matching, ontop of our curated dataset. Experimental results and analysis show that ourproposed XLIR with intermediate representations significantly outperforms otherstate-of-the-art models in both of the two tasks.

 

Quick Read (beta)

loading the full paper ...