Large Language Models for cross-language code clone detection

Abstract

With the involvement of multiple programming languages in modern softwaredevelopment, cross-lingual code clone detection has gained traction within thesoftware engineering community. Numerous studies have explored this topic,proposing various promising approaches. Inspired by the significant advances inmachine learning in recent years, particularly Large Language Models (LLMs),which have demonstrated their ability to tackle various tasks, this paperrevisits cross-lingual code clone detection. We evaluate the performance offive (05) LLMs and eight prompts (08) for the identification of cross-lingualcode clones. Additionally, we compare these results against two baselinemethods. Finally, we evaluate a pre-trained embedding model to assess theeffectiveness of the generated representations for classifying clone andnon-clone pairs. The studies involving LLMs and Embedding models are evaluatedusing two widely used cross-lingual datasets, XLCoST and CodeNet. Our resultsshow that LLMs can achieve high F1 scores, up to 0.99, for straightforwardprogramming examples. However, they not only perform less well on programsassociated with complex programming challenges but also do not necessarilyunderstand the meaning of "code clones" in a cross-lingual setting. We showthat embedding models used to represent code fragments from differentprogramming languages in the same representation space enable the training of abasic classifier that outperforms all LLMs by ~1 and ~20 percentage points onthe XLCoST and CodeNet datasets, respectively. This finding suggests that,despite the apparent capabilities of LLMs, embeddings provided by embeddingmodels offer suitable representations to achieve state-of-the-art performancein cross-lingual code clone detection.

Quick Read (beta)

loading the full paper ...