Abstract
Statistical language modeling techniques have successfully been applied tosource code, yielding a variety of new software development tools, such astools for code suggestion and improving readability. A major issue with thesetechniques is that code introduces new vocabulary at a far higher rate thannatural language, as new identifier names proliferate. But traditional languagemodels limit the vocabulary to a fixed set of common words. For code, thisstrong assumption has been shown to have a significant negative effect onpredictive performance. But the open vocabulary version of the neural networklanguage models for code have not been introduced in the literature. We presenta new open-vocabulary neural language model for code that is not limited to afixed vocabulary of identifier names. We employ a segmentation into subwordunits, subsequences of tokens chosen based on a compression criterion,following previous work in machine translation. Our network achieves best inclass performance, outperforming even the state-of-the-art methods ofHellendoorn and Devanbu that are designed specifically to model code.Furthermore, we present a simple method for dynamically adapting the model to anew test project, resulting in increased performance. We showcase ourmethodology on code corpora in three different languages of over a billiontokens each, hundreds of times larger than in previous work. To our knowledge,this is the largest neural language model for code that has been reported.