Exploring Software Naturalness through Neural Language Models

Abstract

The Software Naturalness hypothesis argues that programming languages can beunderstood through the same techniques used in natural language processing. Weexplore this hypothesis through the use of a pre-trained transformer-basedlanguage model to perform code analysis tasks. Present approaches to codeanalysis depend heavily on features derived from the Abstract Syntax Tree (AST)while our transformer-based language models work on raw source code. This workis the first to investigate whether such language models can discover ASTfeatures automatically. To achieve this, we introduce a sequence labeling taskthat directly probes the language models understanding of AST. Our results showthat transformer based language models achieve high accuracy in the AST taggingtask. Furthermore, we evaluate our model on a software vulnerabilityidentification task. Importantly, we show that our approach obtainsvulnerability identification results comparable to graph based approaches thatrely heavily on compilers for feature extraction.

Quick Read (beta)

loading the full paper ...