Benchmarking LLMs for Translating Classical Chinese Poetry:Evaluating Adequacy, Fluency, and Elegance

Abstract

Large language models (LLMs) have shown remarkable performance in translationtasks. However, the increasing demand for high-quality translations that arenot only adequate but also fluent and elegant. To evaluate the extent to whichcurrent LLMs can meet these demands, we introduce a suitable benchmark (PoetMT)for translating classical Chinese poetry into English. This task requires notonly adequacy in translating culturally and historically significant contentbut also a strict adherence to linguistic fluency and poetic elegance. Toovercome the limitations of traditional evaluation metrics, we propose anautomatic evaluation metric based on GPT-4, which better evaluates translationquality in terms of adequacy, fluency, and elegance. Our evaluation studyreveals that existing large language models fall short in this task. Toevaluate these issues, we propose RAT, a Retrieval-Augmented machineTranslation method that enhances the translation process by incorporatingknowledge related to classical poetry. Our dataset and code will be madeavailable.

Quick Read (beta)

loading the full paper ...