Multimodal Representation for Neural Code Search

Abstract

Semantic code search is about finding semantically relevant code snippets fora given natural language query. In the state-of-the-art approaches, thesemantic similarity between code and query is quantified as the distance oftheir representation in the shared vector space. In this paper, to improve thevector space, we introduce tree-serialization methods on a simplified form ofAST and build the multimodal representation for the code data. We conductextensive experiments using a single corpus that is large-scale andmulti-language: CodeSearchNet. Our results show that both our tree-serializedrepresentations and multimodal learning model improve the performance of codesearch. Last, we define intuitive quantification metrics oriented to thecompleteness of semantic and syntactic information of the code data, to helpunderstand the experimental findings.

Quick Read (beta)

loading the full paper ...