Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow

Abstract

For tasks like code synthesis from natural language, code retrieval, and codesummarization, data-driven models have shown great promise. However, creatingthese models require parallel data between natural language (NL) and code withfine-grained alignments. Stack Overflow (SO) is a promising source to createsuch a data set: the questions are diverse and most of them have correspondinganswers with high-quality code snippets. However, existing heuristic methods(e.g., pairing the title of a post with the code in the accepted answer) arelimited both in their coverage and the correctness of the NL-code pairsobtained. In this paper, we propose a novel method to mine high-quality aligneddata from SO using two sets of features: hand-crafted features considering thestructure of the extracted snippets, and correspondence features obtained bytraining a probabilistic model to capture the correlation between NL and codeusing neural networks. These features are fed into a classifier that determinesthe quality of mined NL-code pairs. Experiments using Python and Java as testbeds show that the proposed method greatly expands coverage and accuracy overexisting mining methods, even when using only a small number of labeledexamples. Further, we find that reasonable results are achieved even whentraining the classifier on one language and testing on another, showing promisefor scaling NL-code mining to a wide variety of programming languages beyondthose for which we are able to annotate data.

Quick Read (beta)

loading the full paper ...