BERT2Code: Can Pretrained Language Models be Leveraged for Code Search?

Abstract

Millions of repetitive code snippets are submitted to code repositories everyday. To search from these large codebases using simple natural language querieswould allow programmers to ideate, prototype, and develop easier and faster.Although the existing methods have shown good performance in searching codeswhen the natural language description contains keywords from the code, they arestill far behind in searching codes based on the semantic meaning of thenatural language query and semantic structure of the code. In recent years,both natural language and programming language research communities havecreated techniques to embed them in vector spaces. In this work, we leveragethe efficacy of these embedding models using a simple, lightweight 2-layerneural network in the task of semantic code search. We show that our modellearns the inherent relationship between the embedding spaces and furtherprobes into the scope of improvement by empirically analyzing the embeddingmethods. In this analysis, we show that the quality of the code embedding modelis the bottleneck for our model's performance, and discuss future directions ofstudy in this area.

Quick Read (beta)

loading the full paper ...