Unsupervised Translation of Programming Languages

Abstract

A transcompiler, also known as source-to-source translator, is a system thatconverts source code from a high-level programming language (such as C++ orPython) to another. Transcompilers are primarily used for interoperability, andto port codebases written in an obsolete or deprecated language (e.g. COBOL,Python 2) to a modern one. They typically rely on handcrafted rewrite rules,applied to the source code abstract syntax tree. Unfortunately, the resultingtranslations often lack readability, fail to respect the target languageconventions, and require manual modifications in order to work properly. Theoverall translation process is timeconsuming and requires expertise in both thesource and target languages, making code-translation projects expensive.Although neural models significantly outperform their rule-based counterpartsin the context of natural language translation, their applications totranscompilation have been limited due to the scarcity of parallel data in thisdomain. In this paper, we propose to leverage recent approaches in unsupervisedmachine translation to train a fully unsupervised neural transcompiler. Wetrain our model on source code from open source GitHub projects, and show thatit can translate functions between C++, Java, and Python with high accuracy.Our method relies exclusively on monolingual source code, requires no expertisein the source or target languages, and can easily be generalized to otherprogramming languages. We also build and release a test set composed of 852parallel functions, along with unit tests to check the correctness oftranslations. We show that our model outperforms rule-based commercialbaselines by a significant margin.

Quick Read (beta)

loading the full paper ...