cushLEPOR: customising hLEPOR metric using Optuna for higher agreement with human judgments or pre-trained language model LaBSE

Abstract

Human evaluation has always been expensive while researchers struggle totrust the automatic metrics. To address this, we propose to customisetraditional metrics by taking advantages of the pre-trained language models(PLMs) and the limited available human labelled scores. We first re-introducethe hLEPOR metric factors, followed by the Python version we developed (ported)which achieved the automatic tuning of the weighting parameters in hLEPORmetric. Then we present the customised hLEPOR (cushLEPOR) which uses Optunahyper-parameter optimisation framework to fine-tune hLEPOR weighting parameterstowards better agreement to pre-trained language models (using LaBSE) regardingthe exact MT language pairs that cushLEPOR is deployed to. We also optimisecushLEPOR towards professional human evaluation data based on MQM and pSQMframework on English-German and Chinese-English language pairs. Theexperimental investigations show cushLEPOR boosts hLEPOR performances towardsbetter agreements to PLMs like LaBSE with much lower cost, and betteragreements to human evaluations including MQM and pSQM scores, and yields muchbetter performances than BLEU (data available at\url{https://github.com/poethan/cushLEPOR}). Official results show that oursubmissions win three language pairs including \textbf{English-German} and\textbf{Chinese-English} on \textit{News} domain via cushLEPOR(LM) and\textbf{English-Russian} on \textit{TED} domain via hLEPOR.

Quick Read (beta)

loading the full paper ...