Abstract
In NLG meta-evaluation, evaluation metrics are typically assessed based ontheir consistency with humans. However, we identify some limitations intraditional NLG meta-evaluation approaches, such as issues in handling humanratings and ambiguous selections of correlation measures, which undermine theeffectiveness of meta-evaluation. In this work, we propose a dual-perspectiveNLG meta-evaluation framework that focuses on different evaluationcapabilities, thereby providing better interpretability. In addition, weintroduce a method of automatically constructing the corresponding benchmarkswithout requiring new human annotations. Furthermore, we conduct experimentswith 16 representative LLMs as the evaluators based on our proposed framework,comprehensively analyzing their evaluation performance from differentperspectives.