AstBERT: Enabling Language Model for Code Understanding with Abstract Syntax Tree

Abstract

Using a pre-trained language model (i.e. BERT) to apprehend source codes hasattracted increasing attention in the natural language processing community.However, there are several challenges when it comes to applying these languagemodels to solve programming language (PL) related problems directly, thesignificant one of which is the lack of domain knowledge issue thatsubstantially deteriorates the model's performance. To this end, we propose theAstBERT model, a pre-trained language model aiming to better understand the PLusing the abstract syntax tree (AST). Specifically, we collect a colossalamount of source codes (both java and python) from GitHub and incorporate thecontextual code knowledge into our model through the help of code parsers, inwhich AST information of the source codes can be interpreted and integrated. Weverify the performance of the proposed model on code information extraction andcode search tasks, respectively. Experiment results show that our AstBERT modelachieves state-of-the-art performance on both downstream tasks (with 96.4% forcode information extraction task, and 57.12% for code search task).

Quick Read (beta)

loading the full paper ...