Investigating the Performance of Language Models for Completing Code in Functional Programming Languages: a Haskell Case Study

Abstract

Language model-based code completion models have quickly grown in use,helping thousands of developers write code in many different programminglanguages. However, research on code completion models typically focuses onimperative languages such as Python and JavaScript, which results in a lack ofrepresentation for functional programming languages. Consequently, these modelsoften perform poorly on functional languages such as Haskell. To investigatewhether this can be alleviated, we evaluate the performance of two languagemodels for code, CodeGPT and UniXcoder, on the functional programming languageHaskell. We fine-tune and evaluate the models on Haskell functions sourced froma publicly accessible Haskell dataset on HuggingFace. Additionally, we manuallyevaluate the models using our novel translated HumanEval dataset. Our automaticevaluation shows that knowledge of imperative programming languages in thepre-training of LLMs may not transfer well to functional languages, but thatcode completion on functional languages is feasible. Consequently, this showsthe need for more high-quality Haskell datasets. A manual evaluation onHumanEval-Haskell indicates CodeGPT frequently generates empty predictions andextra comments, while UniXcoder more often produces incomplete or incorrectpredictions. Finally, we release HumanEval-Haskell, along with the fine-tunedmodels and all code required to reproduce our experiments on GitHub(https://github.com/AISE-TUDelft/HaskellCCEval).

Quick Read (beta)

loading the full paper ...