CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Abstract

Semantic code search is the task of retrieving relevant code given a naturallanguage query. While related to other information retrieval tasks, it requiresbridging the gap between the language used in code (often abbreviated andhighly technical) and natural language more suitable to describe vague conceptsand ideas. To enable evaluation of progress on code search, we are releasing theCodeSearchNet Corpus and are presenting the CodeSearchNet Challenge, whichconsists of 99 natural language queries with about 4k expert relevanceannotations of likely results from CodeSearchNet Corpus. The corpus containsabout 6 million functions from open-source code spanning six programminglanguages (Go, Java, JavaScript, PHP, Python, and Ruby). The CodeSearchNetCorpus also contains automatically generated query-like natural language for 2million functions, obtained from mechanically scraping and preprocessingassociated function documentation. In this article, we describe the methodologyused to obtain the corpus and expert labels, as well as a number of simplebaseline solutions for the task. We hope that CodeSearchNet Challenge encourages researchers and practitionersto study this interesting task further and will host a competition andleaderboard to track the progress on the challenge. We are also keen onextending CodeSearchNet Challenge to more queries and programming languages inthe future.

Quick Read (beta)

loading the full paper ...