Abstract
We present PanGu-Coder, a pretrained decoder-only language model adopting thePanGu-Alpha architecture for text-to-code generation, i.e. the synthesis ofprogramming language solutions given a natural language problem description. Wetrain PanGu-Coder using a two-stage strategy: the first stage employs CausalLanguage Modelling (CLM) to pre-train on raw programming language data, whilethe second stage uses a combination of Causal Language Modelling and MaskedLanguage Modelling (MLM) training objectives that focus on the downstream taskof text-to-code generation and train on loosely curated pairs of naturallanguage program definitions and code functions. Finally, we discussPanGu-Coder-FT, which is fine-tuned on a combination of competitive programmingproblems and code with continuous integration tests. We evaluate PanGu-Coderwith a focus on whether it generates functionally correct programs anddemonstrate that it achieves equivalent or better performance than similarlysized models, such as CodeX, while attending a smaller context window andtraining on less data.