GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding

Abstract

Language models can serve as a valuable tool for software developers toincrease productivity. Large generative models can be used for code generationand code completion, while smaller encoder-only models are capable ofperforming code search tasks using natural language queries.These capabilitiesare heavily influenced by the quality and diversity of the available trainingdata. Source code datasets used for training usually focus on the most popularlanguages and testing is mostly conducted on the same distributions, oftenoverlooking low-resource programming languages. Motivated by the NLPgeneralization taxonomy proposed by Hupkes et.\,al., we propose a new benchmarkdataset called GenCodeSearchNet (GeCS) which builds upon existing naturallanguage code search datasets to systemically evaluate the programming languageunderstanding generalization capabilities of language models. As part of thefull dataset, we introduce a new, manually curated subset StatCodeSearch thatfocuses on R, a popular but so far underrepresented programming language thatis often used by researchers outside the field of computer science. Forevaluation and comparison, we collect several baseline results using fine-tunedBERT-style models and GPT-style large language models in a zero-shot setting.

Quick Read (beta)

loading the full paper ...