Synthetic Programming Elicitation and Repair for Text-to-Code in Very Low-Resource Programming Languages

Abstract

Recent advances in large language models (LLMs) for code applications havedemonstrated remarkable zero-shot fluency and instruction following onchallenging code related tasks ranging from test case generation toself-repair. Unsurprisingly, however, models struggle to compose syntacticallyvalid programs in programming languages unrepresented in pre-training, referredto as very low-resource Programming Languages (VLPLs). VLPLs appear in crucialsettings, including domain-specific languages for internal tools andtool-chains for legacy languages. Inspired by an HCI technique called naturalprogram elicitation, we propose designing an intermediate language that LLMs``naturally'' know how to use and which can be automatically compiled to atarget VLPL. When LLMs generate code that lies outside of this intermediatelanguage, we use compiler techniques to repair the code into programs in theintermediate language. Overall, we introduce \emph{synthetic programmingelicitation and compilation} (SPEAC), an approach that enables LLMs to generatesyntactically valid code even for VLPLs. We empirically evaluate theperformance of SPEAC in a case study and find that, compared to existingretrieval and fine-tuning baselines, SPEAC produces syntactically correctprograms significantly more frequently without sacrificing semanticcorrectness.

Quick Read (beta)

loading the full paper ...