RankME: Reliable Human Ratings for Natural Language Generation

  • 2018-03-15 18:10:45
  • Jekaterina Novikova, Ondřej Dušek, Verena Rieser
  • 2

Abstract

Human evaluation for natural language generation (NLG) often suffers frominconsistent user ratings. While previous research tends to attribute thisproblem to individual user preferences, we show that the quality of humanjudgements can also be improved by experimental design. We present a novelrank-based magnitude estimation method (RankME), which combines the use ofcontinuous scales and relative assessments. We show that RankME significantlyimproves the reliability and consistency of human ratings compared totraditional evaluation methods. In addition, we show that it is possible toevaluate NLG systems according to multiple, distinct criteria, which isimportant for error analysis. Finally, we demonstrate that RankME, incombination with Bayesian estimation of system quality, is a cost-effectivealternative for ranking multiple NLG systems.

 

Quick Read (beta)

loading the full paper ...