DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

  • 2021-02-16 20:42:12
  • Baptiste Roziere, Marie-Anne Lachaux, Marc Szafraniec, Guillaume Lample
  • 0

Abstract

Recent advances in self-supervised learning have dramatically improved thestate of the art on a wide variety of tasks. However, research in languagemodel pre-training has mostly focused on natural languages, and it is unclearwhether models like BERT and its variants provide the best pre-training whenapplied to other modalities, such as source code. In this paper, we introduce anew pre-training objective, DOBF, that leverages the structural aspect ofprogramming languages and pre-trains a model to recover the original version ofobfuscated source code. We show that models pre-trained with DOBF significantlyoutperform existing approaches on multiple downstream tasks, providing relativeimprovements of up to 13% in unsupervised code translation, and 24% in naturallanguage code search. Incidentally, we found that our pre-trained model is ableto de-obfuscate fully obfuscated source files, and to suggest descriptivevariable names.

 

Quick Read (beta)

loading the full paper ...