VEXIR2Vec: An Architecture-Neutral Embedding Framework for Binary Similarity

  • 2024-07-09 18:38:42
  • S. VenkataKeerthy, Soumya Banerjee, Sayan Dey, Yashas Andaluri, Raghul PS, Subrahmanyam Kalyanasundaram, Fernando Magno Quintão Pereira, Ramakrishna Upadrasta
  • 0


Binary similarity involves determining whether two binary programs exhibitsimilar functionality, often originating from the same source code. In thiswork, we propose VexIR2Vec, an approach for binary similarity using VEX-IR, anarchitecture-neutral Intermediate Representation (IR). We extract theembeddings from sequences of basic blocks, termed peepholes, derived by randomwalks on the control-flow graph. The peepholes are normalized usingtransformations inspired by compiler optimizations. The VEX-IR NormalizationEngine mitigates, with these transformations, the architectural andcompiler-induced variations in binaries while exposing semantic similarities.We then learn the vocabulary of representations at the entity level of the IRusing the knowledge graph embedding techniques in an unsupervised manner. Thisvocabulary is used to derive function embeddings for similarity assessmentusing VexNet, a feed-forward Siamese network designed to position similarfunctions closely and separate dissimilar ones in an n-dimensional space. Thisapproach is amenable for both diffing and searching tasks, ensuring robustnessagainst Out-Of-Vocabulary (OOV) issues. We evaluate VexIR2Vec on a dataset comprising 2.7M functions and 15.5Kbinaries from 7 projects compiled across 12 compilers targeting x86 and ARMarchitectures. In diffing experiments, VexIR2Vec outperforms the nearestbaselines by $40\%$, $18\%$, $21\%$, and $60\%$ in cross-optimization,cross-compilation, cross-architecture, and obfuscation settings, respectively.In the searching experiment, VexIR2Vec achieves a mean average precision of$0.76$, outperforming the nearest baseline by $46\%$. Our framework is highlyscalable and is built as a lightweight, multi-threaded, parallel library usingonly open-source tools. VexIR2Vec is $3.1$-$3.5 \times$ faster than the closestbaselines and orders-of-magnitude faster than other tools.


Quick Read (beta)

loading the full paper ...