Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs

Abstract

Over the past few years, Large Language Models of Code (Code LLMs) havestarted to have a significant impact on programming practice. Code LLMs arealso emerging as building blocks for research in programming languages andsoftware engineering. However, Code LLMs produce impressive results onprogramming languages that are well represented in their training data (e.g.,Java, Python, or JavaScript), but struggle with low-resource languages thathave limited training data available. Low resource languages include OCaml,Racket, and several others. This paper presents an effective approach for boosting the performance ofCode LLMs on low-resource languages using semi-synthetic data. Our approach,MultiPL-T, translates training data from high-resource languages into trainingdata for low-resource languages in the following way. 1) We use a Code LLM tosynthesize tests for commented code from a high-resource language, filteringout faulty tests and code with low test coverage. 2) We use a Code LLM totranslate Python code to a target low-resource language, and use tests tovalidate the translation. We apply this approach to generate tens of thousandsof validated training items for Julia, Lua, OCaml, R, and Racket. Furthermore,we use an open model (StarCoderBase) with open training data (The Stack), whichallows us to decontaminate benchmarks, train models without violating licenses,and run experiments that could not otherwise be done. With MultiPL-T generated data, we present fine-tuned versions ofStarCoderBase and Code Llama for Julia, Lua, OCaml, R, and Racket. Onestablished benchmarks (MultiPL-E), these models outperform other open CodeLLMs. The MultiPL-T approach is easy to apply to new languages, and issignificantly more efficient and effective than alternatives such as traininglonger.

Quick Read (beta)

loading the full paper ...