Maybe Deep Neural Networks are the Best Choice for Modeling Source Code

Abstract

Statistical language modeling techniques have successfully been applied tosource code, yielding a variety of new software development tools, such astools for code suggestion and improving readability. A major issue with thesetechniques is that code introduces new vocabulary at a far higher rate thannatural language, as new identifier names proliferate. But traditional languagemodels limit the vocabulary to a fixed set of common words. For code, thisstrong assumption has been shown to have a significant negative effect onpredictive performance. But the open vocabulary version of the neural networklanguage models for code have not been introduced in the literature. We presenta new open-vocabulary neural language model for code that is not limited to afixed vocabulary of identifier names. We employ a segmentation into subwordunits, subsequences of tokens chosen based on a compression criterion,following previous work in machine translation. Our network achieves best inclass performance, outperforming even the state-of-the-art methods ofHellendoorn and Devanbu that are designed specifically to model code.Furthermore, we present a simple method for dynamically adapting the model to anew test project, resulting in increased performance. We showcase ourmethodology on code corpora in three different languages of over a billiontokens each, hundreds of times larger than in previous work. To our knowledge,this is the largest neural language model for code that has been reported.

Quick Read (beta)

loading the full paper ...