SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

Abstract

Instruction-tuned Large Language Models (LLMs) have recently showcasedremarkable advancements in their ability to generate fitting responses tonatural language instructions. However, many current works rely on manualevaluation to judge the quality of generated responses. Since such manualevaluation is time-consuming, it does not easily scale to the evaluation ofmultiple models and model variants. In this short paper, we propose astraightforward but remarkably effective evaluation metric called SemScore, inwhich we directly compare model outputs to gold target responses using semantictextual similarity (STS). We conduct a comparative evaluation of the modeloutputs of 12 prominent instruction-tuned LLMs using 8 widely-used evaluationmetrics for text generation. We find that our proposed SemScore metricoutperforms all other, in many cases more complex, evaluation metrics in termsof correlation to human evaluation. These findings indicate the utility of ourproposed metric for the evaluation of instruction-tuned LLMs.

Quick Read (beta)

loading the full paper ...