Lyra: A Benchmark for Turducken-Style Code Generation

Abstract

Recently, neural techniques have been used to generate source codeautomatically. While promising for declarative languages, these approachesachieve much poorer performance on datasets for imperative languages. Since adeclarative language is typically embedded in an imperative language (i.e., theturducken-style programming) in real-world software development, the promisingresults on declarative languages can hardly lead to significant reduction ofmanual software development efforts. In this paper, we define a new codegeneration task: given a natural language comment, this task aims to generate aprogram in a base imperative language with an embedded declarative language. Toour knowledge, this is the first turducken-style code generation task. For thistask, we present Lyra: a dataset in Python with embedded SQL. This datasetcontains 2,000 carefully annotated database manipulation programs fromreal-world projects. Each program is paired with both a Chinese comment and anEnglish comment. In our experiment, we adopted Transformer, BERT-style, andGPT-style models as baselines. In the best setting, the generation performanceof GPT-style models is better than others, where the AST exact matchingaccuracy is 24% and 25.5% when using Chinese and English comments,respectively. Therefore, we believe that Lyra provides a new challenge for codegeneration. Yet, overcoming this challenge may significantly boost theapplicability of code generation techniques for real-world softwaredevelopment.

Quick Read (beta)

loading the full paper ...