On conducting better validation studies of automatic metrics in natural language generation evaluation

Abstract

Natural language generation (NLG) has received increasing attention, whichhas highlighted evaluation as a central methodological concern. Since humanevaluations for these systems are costly, automatic metrics have broad appealin NLG. Research in language generation often finds situations where it isappropriate to apply existing metrics or propose new ones. The application ofthese metrics are entirely dependent on validation studies - studies thatdetermine a metric's correlation to human judgment. However, there are manydetails and considerations in conducting strong validation studies. Thisdocument is intended for those validating existing metrics or proposing newones in the broad context of NLG: we 1) begin with a write-up of best practicesin validation studies, 2) outline how to adopt these practices, 3) conductanalyses in the WMT'17 metrics shared task\footnote{Our jupyter notebookcontaining the analyses is available at \url{https://github.com}}, and 4)highlight promising approaches to NLG metrics 5) conclude with our opinions onthe future of this area.

Quick Read (beta)

loading the full paper ...