Abstract
Pretrained language models have shown strong effectiveness in code-relatedtasks, such as code retrieval, code generation, code summarization, and codecompletion tasks. In this paper, we propose COde assistaNt viAretrieval-augmeNted language model (CONAN), which aims to build a codeassistant by mimicking the knowledge-seeking behaviors of humans during coding.Specifically, it consists of a code structure aware retriever (CONAN-R) and adual-view code representation-based retrieval-augmented generation model(CONAN-G). CONAN-R pretrains CodeT5 using Code-Documentation Alignment andMasked Entity Prediction tasks to make language models code structure-aware andlearn effective representations for code snippets and documentation. ThenCONAN-G designs a dual-view code representation mechanism for implementing aretrieval-augmented code generation model. CONAN-G regards the codedocumentation descriptions as prompts, which help language models betterunderstand the code semantics. Our experiments show that CONAN achievesconvincing performance on different code generation tasks and significantlyoutperforms previous retrieval augmented code generation models. Our furtheranalyses show that CONAN learns tailored representations for both code snippetsand documentation by aligning code-documentation data pairs and capturingstructural semantics by masking and predicting entities in the code data.Additionally, the retrieved code snippets and documentation provide necessaryinformation from both program language and natural language to assist the codegeneration process. CONAN can also be used as an assistant for Large LanguageModels (LLMs), providing LLMs with external knowledge in shorter code documentlengths to improve their effectiveness on various code tasks. It shows theability of CONAN to extract necessary information and help filter out the noisefrom retrieved code documents.