CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation

Abstract

Utilizing large language models to generate codes has shown promising meaningin software development revolution. Despite the intelligence shown by thegeneral large language models, their specificity in code generation can stillbe improved due to the syntactic gap and mismatched vocabulary existing amongnatural language and different programming languages. In this paper, we proposeCodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhancethe performance of LLMs. CodeGRAG builds the graphical view of code blocksbased on the control flow and data flow of them to fill the gap betweenprogramming languages and natural language, which can facilitate naturallanguage based LLMs for better understanding of code syntax and serve as abridge among different programming languages. To take the extracted structuralknowledge into the foundation models, we propose 1) a hard meta-graph prompttemplate to transform the challenging graphical representation into informativeknowledge for tuning-free models and 2) a soft prompting technique that injectsthe domain knowledge of programming languages into the model parameters viafinetuning the models with the help of a pretrained GNN expert model. Variousexperiments and ablations are done on four datasets including both the C++ andpython languages to validate the hard meta-graph prompt, the soft promptingtechnique, and the effectiveness of the objectives for pretrained GNN expert.CodeGRAG improves the code generation ability of LLMs and can even offerperformance gain for cross-lingual code generation. The implementation isavailable at https://anonymous.4open.science/r/Code-5970/.

Quick Read (beta)

loading the full paper ...