CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Abstract

Large language models (LLMs) pretrained on vast source code have achievedprominent progress in code intelligence. However, existing code LLMs have twomain limitations in terms of architecture and pretraining tasks. First, theyoften adopt a specific architecture (encoder-only or decoder-only) or rely on aunified encoder-decoder network for different downstream tasks. The formerparadigm is limited by inflexibility in applications while in the latter, themodel is treated as a single system for all tasks, leading to suboptimalperformance on a subset of tasks. Secondly, they often employ a limited set ofpretraining objectives which might not be relevant to some downstream tasks andhence result in substantial performance degrade. To address these limitations,we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in whichcomponent modules can be flexibly combined to suit a wide range of downstreamcode tasks. Such flexibility is enabled by our proposed mixture of pretrainingobjectives to mitigate the pretrain-finetune discrepancy. These objectivescover span denoising, contrastive learning, text-code matching, and causal LMpretraining tasks, on both unimodal and bimodal multilingual code corpora.Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMswithout training from scratch to efficiently scale up our models, and exploreinstruction-tuning to align with natural language instructions. We extensivelyevaluate CodeT5+ on over 20 code-related benchmarks in different settings,including zero-shot, finetuning, and instruction-tuning. We observestate-of-the-art (SoTA) model performance on various code-related tasks, suchas code generation and completion, math programming, and text-to-code retrievaltasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTAresults on HumanEval code generation task against other open code LLMs.

Quick Read (beta)

loading the full paper ...