Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy

Abstract

Selecting an automatic metric that best emulates human annotators is oftennon-trivial, because there is no clear definition of "best emulates." Ameta-metric is required to compare the human judgments to the automatic metricscores, and metric rankings depend on the choice of meta-metric. We proposeSoft Pairwise Accuracy (SPA), a new meta-metric that builds on PairwiseAccuracy (PA) but incorporates the statistical significance of both the humanjudgments and the metric scores. We show that SPA is more stable than PA withrespect to changes in the number of systems/segments used for evaluation. Wealso show that PA can only assign a small set of distinct output values tometrics, and this results in many metrics being artificially assigned the exactsame PA score. We demonstrate that SPA fixes this issue. Finally, we show thatSPA is more discriminative than PA, producing more statistically significantcomparisons between metrics. SPA was selected as the official system-levelmetric for the 2024 WMT Metrics Shared Task.

Quick Read (beta)

loading the full paper ...