Abstract
Large language models are increasingly becoming a popular tool for softwaredevelopment. Their ability to model and generate source code has beendemonstrated in a variety of contexts, including code completion,summarization, translation, and lookup. However, they often struggle togenerate code for complex programs. In this paper, we study the capabilities ofstate-of-the-art language models to generate parallel code. In order toevaluate language models, we create a benchmark, ParEval, consisting of promptsthat represent 420 different coding tasks related to scientific and parallelcomputing. We use ParEval to evaluate the effectiveness of severalstate-of-the-art open- and closed-source language models on these tasks. Weintroduce novel metrics for evaluating the performance of generated code, anduse them to explore how well each large language model performs for 12different computational problem types and six different parallel programmingmodels.