GraphCodeBERT: Pre-training Code Representations with Data Flow

Abstract

Pre-trained models for programming language have achieved dramatic empiricalimprovements on a variety of code-related tasks such as code search, codecompletion, code summarization, etc. However, existing pre-trained modelsregard a code snippet as a sequence of tokens, while ignoring the inherentstructure of code, which provides crucial code semantics and would enhance thecode understanding process. We present GraphCodeBERT, a pre-trained model forprogramming language that considers the inherent structure of code. Instead oftaking syntactic-level structure of code like abstract syntax tree (AST), weuse data flow in the pre-training stage, which is a semantic-level structure ofcode that encodes the relation of "where-the-value-comes-from" betweenvariables. Such a semantic-level structure is neat and does not bring anunnecessarily deep hierarchy of AST, the property of which makes the model moreefficient. We develop GraphCodeBERT based on Transformer. In addition to usingthe task of masked language modeling, we introduce two structure-awarepre-training tasks. One is to predict code structure edges, and the other is toalign representations between source code and code structure. We implement themodel in an efficient way with a graph-guided masked attention function toincorporate the code structure. We evaluate our model on four tasks, includingcode search, clone detection, code translation, and code refinement. Resultsshow that code structure and newly introduced pre-training tasks can improveGraphCodeBERT and achieves state-of-the-art performance on the four downstreamtasks. We further show that the model prefers structure-level attentions overtoken-level attentions in the task of code search.

Quick Read (beta)

loading the full paper ...