CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation

Abstract

Utilizing large language models to generate codes has shown promising meaningin software development revolution. Despite the intelligence shown by the largelanguage models, their specificity in code generation can still be improved dueto the syntactic gap and mismatched vocabulary existing between naturallanguage and programming languages. In this paper, we propose CodeGRAG, aGraphical Retrieval Augmented Code Generation framework that bridges the gapbetween NL and PL to enhance the performance of LLMs. CodeGRAG builds thegraphical view of code blocks based on the control flow and data flow of themto better interpret the programming domain knowledge, which can facilitatenatural language based LLMs for better understanding of code syntax and serveas a bridge among different programming languages. To take the extractedstructural knowledge into the foundation models, we propose 1) a hardmeta-graph prompt template to transform the challenging syntax graph intoinformative graphical view for tuning-free models and 2) a soft promptingtechnique that injects the domain knowledge of programming languages into modelparameters via finetuning the models with the soft signals encoded by GNNexpert model. Specifically, two constraints are designed to improve thealignment and structure expressiveness, contributing to the informativeness ofthe single-token-sized external <GraphEmb> for enhanced code generation.CodeGRAG significantly improves the code generation ability of LLMs and caneven offer performance gain for cross-lingual code generation. Implementationis available at https://anonymous.4open.science/r/Code-5970/ .

Quick Read (beta)

loading the full paper ...