MonoCoder: Domain-Specific Code Language Model for HPC Codes and Tasks

Abstract

With easier access to powerful compute resources, there is a growing trend inAI for software development to develop large language models (LLMs) to addressa variety of programming tasks. Even LLMs applied to tasks from thehigh-performance computing (HPC) domain are huge in size and demand expensivecompute resources for training. This is partly because LLMs for HPC tasks areobtained by finetuning existing LLMs that support several natural and/orprogramming languages. We found this design choice confusing - why do we needLLMs trained on natural languages and programming languages unrelated to HPCfor HPC-specific tasks? In this line of work, we aim to question choices madeby existing LLMs by developing smaller language models (LMs) for specificdomains - we call them domain-specific LMs. Specifically, we start with HPC asa domain and build an HPC-specific LM, named MonoCoder, which is orders ofmagnitude smaller than existing LMs but delivers better performance on non-HPCand HPC codes. Specifically, we pre-trained MonoCoder on an HPC-specificdataset (named HPCorpus) of C and C++ programs mined from GitHub. We evaluatedthe performance of MonoCoder against state-of-the-art multi-lingual LLMs.Results demonstrate that MonoCoder, although much smaller than existing LMs,outperforms other LLMs on normalized-perplexity tests (in relation to modelsize) while also delivering competing CodeBLEU scores for high-performance andparallel code generations. In other words, results suggest that MonoCoderunderstands HPC code better than state-of-the-art LLMs.

Quick Read (beta)

loading the full paper ...