Abstract
Recent advancements in the field of natural language generation havefacilitated the use of large language models to assess the quality of generatedtext. Although these models have shown promising results in tasks such asmachine translation and summarization, their applicability in code intelligencetasks remains limited without human involvement. The complexity of programmingconcepts required for such tasks makes it difficult to develop evaluationmetrics that align with human judgment. Token-matching-based metrics, such asBLEU, have demonstrated weak correlations with human practitioners in codeintelligence tasks. Moreover, utilizing human-written test suites to evaluatefunctional correctness can be challenging in domains with low resources. Toovercome these obstacles, we propose \texttt{ICE-Score}, a new evaluationmetric via instructing large language models (LLMs) for code assessments. Ourmetric addresses the limitations of existing approaches by achieving superiorcorrelations with functional correctness and human preferences, without theneed for test oracles or references. We evaluate the efficacy of our metric ontwo different aspects (\textit{human preference} and \textit{executionsuccess}) and four programming languages. Our results demonstrate that ourmetric surpasses state-of-the-art metrics for code generation, delivering highlevels of accuracy and consistency across various programming languages andtasks. We also make our evaluation metric and datasets available to thepublic\footnote{\url{https://github.com/terryyz/ice-score}}, encouragingfurther research in evaluating code intelligence tasks.