Synthetic Programming Elicitation for Text-to-Code in Very Low-Resource Programming and Formal Languages

Abstract

Recent advances in large language models (LLMs) for code applications havedemonstrated remarkable zero-shot fluency and instruction following onchallenging code related tasks ranging from test case generation toself-repair. Unsurprisingly, however, models struggle to compose syntacticallyvalid programs in programming languages unrepresented in pre-training, referredto as very low-resource Programming Languages (VLPLs). VLPLs appear in crucialsettings, including domain-specific languages for internal tools, tool-chainsfor legacy languages, and formal verification frameworks. Inspired by atechnique called natural programming elicitation, we propose designing anintermediate language that LLMs "naturally" know how to use and which can beautomatically compiled to a target VLPL. When LLMs generate code that liesoutside of this intermediate language, we use compiler techniques to repair thecode into programs in the intermediate language. Overall, we introduce\emph{synthetic programming elicitation and compilation} (SPEAC), an approachthat enables LLMs to generate syntactically valid code even for VLPLs. Weempirically evaluate the performance of SPEAC in a case study for the UCLID5formal verification language and find that, compared to existing retrieval andfine-tuning baselines, SPEAC produces syntactically correct programs morefrequently and without sacrificing semantic correctness.

Quick Read (beta)

loading the full paper ...