DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Abstract

Recent advances in self-supervised learning have dramatically improved thestate of the art on a wide variety of tasks. However, research in languagemodel pre-training has mostly focused on natural languages, and it is unclearwhether models like BERT and its variants provide the best pre-training whenapplied to other modalities, such as source code. In this paper, we introduce anew pre-training objective, DOBF, that leverages the structural aspect ofprogramming languages and pre-trains a model to recover the original version ofobfuscated source code. We show that models pre-trained with DOBF significantlyoutperform existing approaches on multiple downstream tasks, providing relativeimprovements of up to 13% in unsupervised code translation, and 24% in naturallanguage code search. Incidentally, we found that our pre-trained model is ableto de-obfuscate fully obfuscated source files, and to suggest descriptivevariable names.

Quick Read (beta)

loading the full paper ...